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Preface 


Serendipitous discoveries are fascinating events of science inducing, at times, 
paradigm shifts that give rise to new disciplines tout-court. 

This is what happened with pangenomics: a novel discipline at the intersection of 
biology, computer science and applied mathematics, whose discovery, development 
to state of the art and future perspectives are tentatively collected in this book for the 
first time, 15 years after its inception. 

In simple terms, the pangenome concept is the realization that the genetic 
repertoire of a biological species, i.e. the pool of genetic material present across 
the organisms of the species, always exceeds each of the individual genomes and can 
be, in several cases, “unbounded”: an open pangenome. 

This notion was conceived in 2005 as an unexpected, data-driven outcome of the 
comparative analyses of a few bacterial genomes. This early example of big data in 
biology—in which a mathematical model, developed to address a practical question 
in vaccinology, transformed established concepts—opened biology to the 
unbounded. 

Since then, the advent of next-generation sequencing and computational technol- 
ogies has afforded the generation of pangenomes from thousands of isolates and 
non-cultured samples of many microbial species, first, and then of eukaryotes 
encompassing all the kingdoms of life, confirming and extending the original 
hypothesis beyond the most ambitious expectations. 

The first part of the book, Genomic diversity and the pangenome concept, opens 
with a historical account of the original discovery, the observed analogy between 
genomic sequences and text corpora that allowed the application of mathematical 
linguistics to the analysis of genomic diversity and the emergence of the pangenome 
concept in bacteria. 

In the second chapter, the reader will find an extensive introduction of the 
biological species concept with its challenges, the processes associated with the 
birth and development of a new species and the implications for its pangenome 
limits. 
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The following chapter provides a perspective on genome plasticity, pangenome 
size and functional diversity from the unique point of view of the bacterium itself, 
followed in the last chapter of the section by a systematic review of the increasingly 
sophisticated and performant bioinformatic pipelines that have been made available 
to the scientific community, transforming pangenomics into a commodity tool for 
the twenty-first century biologist. 

The second part, Evolutionary biology of pangenomes, aims at making sense of 
pangenomics through the explanatory perspective of evolution. 

As Theodosius Dobzhansky attested half a century ago, nothing in biology 
makes sense except in the light of evolution. Pangenomes are no exception, as the 
genetic diversity observed in a species is the direct result of the evolutionary 
interplay between its member organisms and their environment. The effort is 
facilitated by the significant advances made in the last decade by mathematical 
modelling, systems theory and computational simulations, in an attempt to clarify 
the functional mechanisms underpinning diversity generation at the population level, 
especially in prokaryotes. 

The first chapter of this section? moves from the dynamic forces that shape 
pangenome variations, particularly horizontal gene transfer, to discuss the implica- 
tions for population structures and their ecological significance. 

The second chapter analyses the microevolution of bacterial populations 
by introducing a neutral phylogenetic framework open to the assessment of natural 
selection and discusses how to reconstruct the microevolutionary history of an 
entire pangenome. The relationship between pangenomes and selection is further 
explored in the following chapter, which proposes a stimulating view of 
pangenomics based on the economic theory of public goods, resulting in the 
hypothesis that pangenomes are constructed and maintained by niche adaptation. 
The section closes with a zoom into the alarming public health crisis of antimicrobial 
resistance, where the authors consider how the pangenome affects the response to 
antibiotics, the development of resistance and the role of the selective pressures 
induced by antibiotics and discuss how the pangenome paradigm can foster the 
development of effective therapies. 

The third part, Pangenomics: an open, evolving discipline, takes the reader on a 
journey through applications of pangenome approaches beyond just genes and 
sequences for prokaryotes and into the realm of eukaryotes. Indeed, as the 
pangenome concept evolves and genomes from multiple isolates/individuals within 
virtually all living species become available, it is important to study and challenge 
the concept beyond the primary genomic sequence and beyond the bacterial world. 
While most of the pangenome studies published to date focus on genes as the unit, 


"Theodosius Dobzhansky, The American Biology Teacher, Vol. 35 No. 3, March, 1973; 
(pp. 125-129) DOI: https://doi.org/10.2307/4444260 

Contributed by the brave scholar who once told the late Prof. Stanley Falkow “this is simply 
because, Stan, you don’t understand population biology” [Conference on “Microbial population 
genomics: sequence, function and diversity”, Novartis Vaccines Research Center, Siena (Italy), 
17-19 January, 2007]. 
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any sequence (e.g. promoter, intron, intergenic region and mobile element) could be 
used as the unit to account for the many levels of variation and regulation governing 
a population, including entire communities occupying a particular niche. 

The first chapter of section three provides a vision of how pangenome analyses 
can be applied to the study of multiple species within a community or microbiome 
and how outcomes will lead to the characterization of pan-metagenomes across 
niches or environments. The second chapter describes procedures to infer the 
biological impact of pangenomic diversity, translating it into functional pathways 
and their rendition as phenotypes, or panphenomes. The third chapter brings the 
additional layer of epigenetic regulation into the picture, describing modification 
processes, methods to detect them and their relationship with the pangenome. 
Finally, the application of pangenome studies to other kingdoms of life beyond 
bacteria is a natural extension of the concept. Chapter four provides a detailed 
overview of eukaryotic genome projects, their genome dynamics and associated 
pangenome analyses, while the fifth and last chapter of this book compares and 
contrasts computational strategies that can be implemented towards the characteri- 
zation of eukaryotic pangenomes. 

We hope that this book, thanks to the extraordinary quality of the contributions 
from each of the authors involved, will provide a broad readership of life scientists 
with a useful tool for getting acquainted with—or delving deeper into—the 
pangenome concept and its theoretical foundations, for getting up to speed with 
the latest technologies and applications of pangenomics, or simply to explore one of 
the most exciting novelties of twenty-first century biology. 

Should pangenomics continue to develop at the current pace, this volume would 
soon be outdated by the forthcoming developments, killed by its own success. 

However, we believe that the elements captured herein—the serendipitous 
dynamics of the data-driven discovery and the fundamental mindset shift, the 
understanding of the mechanisms through evolutionary biology, the perspectives 
and impacts of pangenomics for all kingdoms of life—might remain as a useful 
reference for the life science community in the years to come. 


Baltimore, MD, USA Hervé Tettelin 
Siena, Italy Duccio Medini 
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Part I 
Genomic Diversity and the Pangenome 
Concept 


The Pangenome: A Data-Driven Discovery ™@ 
in Biology im 


Duccio Medini, Claudio Donati, Rino Rappuoli, and Hervé Tettelin 


Abstract An early example of Big data in biology: how a mathematical model, 
developed to address a practical question in vaccinology, transformed established 
concepts, opening biology to the *unbounded." 


Keywords Pangenome - Heaps’ law - Reverse vaccinology : Group B 
Streptococcus - Big data - Unbounded diversity 


1 The Quest for a Streptococcus agalactiae Vaccine 


In August of 2000, a collaboration between Rino Rappuoli's team, including Duccio 
Medini, Claudio Donati, and Antonello Covacci at Chiron Vaccines in Siena, Italy, 
and Claire Fraser's group, including Hervé Tettelin at the Institute for Genomic 
Research (TIGR) in Rockville, MD USA, was established to apply their recently 
pioneered reverse vaccinology approach (Pizza et al. 2000; Tettelin et al. 2000) to 
the problem of neonatal Group B Streptococcus (GBS, or Streptococcus agalactiae) 
infections (Fig. la). The collaboration also included Dennis Kasper, Michael 
Wessels, and colleagues, experts in GBS biology from the Boston Children's 
Hospital, Harvard Medical School, Boston, MA USA. 

GBS is a leading cause of neonatal life-threatening infections, despite the exten- 
sive application of antibiotic prophylaxis. Therefore, a vaccine was dearly needed to 
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Fig. 1 Pangenome visuals. (a) 1999—Plymouth (NH, USA): Rino and Hervé in the woods around 
the time of initial discussions about the GBS collaboration. (b) 2004—Rockville (MD, USA): 
Pangenome early sketch and (Hervé the) gnome in his pants. (c) Early 2005— Siena (Italy): Duccio 
and Claudio labor over the pangenome formula development. (d) 2018—Ellicott City (MD, USA): 
pangenome book editing, Hervé and Duccio locked in the basement 


effectively prevent GBS infections. The manufacturing of a capsular polysaccharide- 
based vaccine was hindered by the existence and high incidence of at least five 
different disease-causing serotypes of GBS. Thus, the collaborative team embarked 
on the development of a GBS protein-based vaccine. 

The concept was to use the Streptococcus agalactiae genome sequence informa- 
tion to predict proteins likely to be surface exposed and use these in experimental 
assays for antigenicity and antibody accessibility toward the development of a GBS 
vaccine via active maternal immunization [for details on GBS reverse vaccinology, 
see Maione et al. (2005)]. 

Unlike the case of Neisseria meningitidis, with which reverse vaccinology was 
pioneered right before the GBS project using a single genome, two GBS gap-free 
genomes were available when the project was initiated, and more genomes were gener- 
ated early in the course of the project. Indeed, Tettelin et al. [TIGR (Tettelin et al. 2002)] 
and Glaser et al. [Pasteur Institute, France (Glaser et al. 2002)] independently reported the 
first two complete gap-free genome sequences of GBS in September of 2002. 

At that time, sequencing multiple strains or isolates of the same species was far from 
commonplace. Both strains, serotype V 2603 V/R and serotype III NEM3 16, were clinical 
isolates. Glaser et al. compared their NEM316 genome to that of Streptococcus pyogenes 
(group A Streptococcus, GAS) and concluded that 50% of the GBS genes without an 
ortholog in GAS were located in 14 potential pathogenicity islands enriched in genes 
related to virulence and mobile elements. Tettelin et al. used a microarray-based compar- 
ative genomic hybridization (CGH) approach, whereby they hybridized the genomic 
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DNA of each of 19 GBS isolates of various serotypes onto a microarray of spotted 2603 
V/R gene-specific amplicons, and identified several regions of genomic diversity among 
GBS isolates, including between isolates of the same serotype (see Fig. 2a). 

These separate studies provided the first evidence that a significant amount of 
genomic information or gene content was variable among closely related streptococcal 
isolates, challenging the commonly accepted notion that the genome of a single isolate 
of a given species was sufficient to represent the genomic content of that species. Based 
on this understanding, the collaborative team decided to generate an additional 6 GBS 
genomes (Tettelin et al. 2005), selecting isolates from the five major disease-causing 
serotypes known at the time. The genome of the serotype Ia strain A909 was sequenced 
to completion in collaboration with the group of Craig Rubens at Children’s Hospital 
and Regional Medical Center, Seattle, WA, USA. The other five strains—S 15 (serotype 
Ia), H36B (serotype Ib), 18RS21 (serotype ID, COHI (serotype IM), and CJB111 
(serotype V)—were sequenced as draft genomes, i.e., no attempt was made to manually 
close the gaps existing between contigs of the genome assemblies.' Comparison of the 
eight GBS whole-genome sequences confirmed the presence of the regions of genomic 
diversity previously identified by CGH (see Fig. 2b). 

Surprisingly for the time, the shared backbone, or core set of genes present in 
each of the eight genomes, amounted to only about 80% of any individual genome's 
gene coding potential. Within these eight genomes, there was no pair that was nearly 
identical. Instead, each genome contributed a significant number of new strain- 
specific genes not present in any of the other genomes sequenced. Other sets of 
genes were shared by some but not all of the genomes. 

This large amount of genomic diversity, which was not correlated to GBS sero- 
types, did not fail to stun members of the investigative team, including the experts in 
GBS biology. It also prompted an important question that formed the foundation of 
the pangenome concept: "How many genomes from isolates of the GBS species do 
we need to sequence to be confident that we identified all of the genes that can be 
harbored by GBS as a whole?" 

This question, motivated by the need to identify all potential vaccine candidates for 
the species, led to active discussions among the collaborators, the drawing of highly 
accurate and inspirational scientific sketches (see Fig. 1b), and the decision to develop a 
mathematical model to determine how many other strains should have been sequenced. 


2 When Data Amount and Complexity Exceed What Can 
Be Done Without Mathematics 


” 


The question was clear: “how many genomes...," ie. the answer had to be a 
number. And a clear question is always a great way to start. 


'It should be noted that the COH1 genome, a representative of the highly prevalent disease-causing 
CC17 clonal complex, was later released as a gap-free genome (NCBI BioProject: PRJEB5232). 
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Fig. 2 Group B Streptococcus (GBS) genome diversity data that led to the pangenome discovery. 
(a) Comparative genome hybridization (CGH) provided a first hint about the high degree of 
genomic diversity within the GBS species. This circular representation of the GBS 2603 V/R 
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When the team in Siena was asked to figure out how to come up with an answer, 
they were faced with two assumptions, implicit in the question itself. First, the 
number was expected to be larger than eight, as the presence of specific genes in 
each of the eight isolates already sequenced suggested. Second, such a finite number 
was expected to exist. 

The whole concept of biological species, a cornerstone of classical cladistics text- 
books, had been evolving already toward the “species genome” concept thanks to the 
genomic revolution. The common knowledge, though, still held a 1:1 relationship 
between the species and the genome concepts. Consequently, a well-defined genetic 
repertoire for a bacterial species was the most natural assumption, implying that a finite— 
and hopefully small—number of genome sequences would be sufficient to exhaust it. 

Genomic data had already introduced complexity and size in biology a decade 
before, when substantial mathematical work had been required to succeed in assem- 
bling tens of thousands of Sanger reads into a reconstructed chromosomal sequence 
(Sutton et al. 1995). 

Here complexity and size were growing again, as the population scale of a species 
was being explored. More mathematical modeling was needed to translate the 
comparison among genomic data into a number. 

Any modeling work starts with arbitrary choices. The first choice—that would 
remain a cornerstone of pangenome pipelines in the decades to come—was to adopt 
a reference-free approach. 

Population genomics had been explored to that point mostly through cDNA 
microarrays (CGH), where the experimental design favors the physical comparison 
of DNA from many isolates with a reference one, usually a well-known laboratory 
strain used worldwide by the scientific community. 

This approach has benefits also for in silico comparative genomics, because the 
number of comparisons to be performed scales linearly with the number of genome 
sequences to be compared, ie. for any new isolate, one more comparison is 
performed. Also, the high-quality annotation of a well-studied genome can be easily 
transferred onto the others. However, the reference-based approach introduces 
strong limitations biasing the comparisons versus one specific individual of the 
species, which usually has no other ecological merit than having been around in 
microbiology labs for decades. 


< 


Fig. 2 (continued) genome shows predicted ORFs in the two outermost rims and those variable 
(blue bars) or absent (red bars) in the 19 genomic DNAs hybridized onto the 2603 V/R gene 
amplicon microarray. Regions of diversity are numbered 1—15 [for details, see (Tettelin et al. 
2002)]. (b) In silico comparative genomic analysis of 8 GBS genomes confirmed CGH results and 
revealed additional regions of diversity using each genome as a reference. In this display, genes are 
arbitrarily color-coded by position in their genome along a gradient from yellow to blue. Genes are 
then depicted above their ortholog in the reference genome using the color they have in their home 
genome. Breaks in the color gradient reveal rearrangements and white regions reveal genomic 
regions absent in query genomes when compared to the reference. Each panel corresponds to each 
of the eight genomes used as the reference [for details, see Tettelin et al. (2005)]. Copyright 2002, 
2005 National Academy of Sciences 
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Looking for a holistic assessment of a species diversity, the reference-free 
approach was natural, but it came with the disadvantage of scaling quadratically, 
ie. any new genome would have to be compared to all the genomes already 
considered, leading to significant computational challenges.” 

The second modeling choice was to use the gene as a unit of comparison or, more 
precisely, the open reading frames (ORFs) bioinformatically predicted on each 
genome sequence. Consequently, the analysis focused on an arbitrary subset of the 
genetic material, ignoring noncoding sequences whose relevance would have been 
increasingly appreciated in the years to come. Also, it implied accepting a certain 
number of nucleotide-level polymorphisms as not relevant for the diversity they 
were trying to model: allelic variants of the same gene would be considered as the 
same entity, as the problem was not to characterize microevolution—that strains 
accumulate mutations was well known—but to quantify the amount of “novel” 
genetic material contributed by each new sequence. 

Intuitively, the more genomes analyzed, the fewer new genes (ORFs not observed 
with sufficient similarity in any other genome) should be identified. To answer the 
original question (“how many genomes. . .") the team decided to determine the pace 
at which new genes would decrease with increasing numbers of genomes sequenced, 
in order to extrapolate the trend toward the number of genomes corresponding to no 
new genes identified. 

As the number of new genes identified in the n-th genome depends on the selection 
of both the n-th genome itself and the previous n — 1 genomes considered, for each 
n from 1 to 8 we considered all the 8!/[(n — 1)!-(8 — n)!] possible combinations to 
avoid bias, i.e., a total of 1024 pairwise, whole genome vs. whole-genome compar- 
isons, i.e., ~2 billion gene vs. gene comparisons. 

For each n from 1 to 8, we obtained a cloud of values and, following the same 
approach, the number of core genes (ORFs observed with sufficient similarity in all 
other genomes) was also measured. 

Both new and core gene averages showed the expected decreasing trends, with 
the number of core genes for GBS decreasing exponentially toward the asymptotic 
value of 1806. 

Surprisingly, though, the decreasing number of new genes was not trending toward 
zero in any way. Rather, the trend was reasonably reproduced by an exponential decay 
converging to a fixed value of 33, significantly greater than zero (see Fig. 3a). 

In summary, mathematical extrapolation of the trend observed with the first eight 
genomes indicated that, for every new genome sequenced, new genes would have 
been discovered, even after a large number of genomes had been sequenced. 

The extrapolation had two immediate implications: (1) no number of sequenced 
genomes would have assured a complete sampling of the GBS species pangenome, 
because (ii) the genetic repertoire of the species had to be considered as an 
unbounded entity. 


?This would have been mitigated a few years later by the introduction of an unbiased, random 
sampling adjustment (Tettelin et al. 2008). 
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Fig. 3 Mathematical models revealing the “unbounded” pangenome. (a) The first GBS pangenome 
(Tettelin et al. 2005), copyright 2005 National Academy of Sciences. The number of specific genes 
is plotted as a function of the number n of strains sequentially added. The blue curve is the least- 
squares fit of the exponential pangenome function to the data. The extrapolated average number of 
strain-specific genes is shown as a dashed line. (Inset) Size of the GBS pangenome as a function of 
n. The red curve is the calculated pangenome size with values of the parameters obtained from the fit 
of the pangenome function to data. (b) The refined power-law pangenome model (Tettelin et al. 
2008). Pangenome of Bacillus cereus using medians and a power-law fit. The total number of genes 
found with the pangenome analysis is shown for increasing numbers of genomes sequenced. 
Medians of the distributions are indicated by red squares. The curve is a least-squares fit of the 
power-law pangenome function to medians. The exponent y > 0 indicates an open pangenome 
species 
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Understandably, the conclusion elicited in the group reactions comprised between 
complacent irony and the gentle suggestion to redo the work and find the mistake 
(see Fig. 1c). 

So the team did, adding different alignment algorithms, running accurate sensi- 
tivity analyses on the thresholds adopted for sequence alignment, applying the same 
pipeline to other bacterial species known to be less variable as negative control and 
rechecking every line of the code. Eventually, the team agreed that the extrapolation 
was correct and novel genes belonging to the GBS species would be found even after 
sequencing a very large number of genomes. At this point, the team realized the need 
for a new entity in the genome world to account for those genes that belong to the 
species but are not present in some genomes. After long discussions, the team agreed 
on the pangenome concept and described the pangenome of each species by three 
differentiated components: its core genome, i.e., the genes present in each isolate of 
the species; its accessory genome, also called initially dispensable, i.e., the genes 
present in several but not all isolates; and, finally, its strain-specific genes, sampled 
in one isolate only. 

As it would become apparent a few years later, when more genome sequences 
became available, and for multiple species, a much more accurate description of the 
trend of new genes would have been provided by a power-law function (derived 
from the Heaps' law, see Fig. 3b) actually decreasing to zero, as described in more 
detail in the next section. 

But for S. agalactiae and some other species, the exponent of the power law was 
smaller than a critical value, i.e., the decrease of the number of new genes observed 
with new genomes was so slow, that the size of the pangenome remained an 
increasing and unbounded function of the number of genomes considered, as is 
the number of new words discovered in text corpora written in a live language 
(Heaps 1978). In other words, although the initial modeling work was still incom- 
plete, the conclusion was already correct. 

Another critical element that would have gained relevance over time in 
pangenome analyses, was the heterogeneity of the population sampling. As in any 
population-modeling exercise, the conclusions at the population scale are heavily 
dependent on the randomness of the sample, particularly if small, and can be 
seriously affected by the presence of structure in the population. If only a few, 
related isolates would be sequenced in an otherwise heterogeneous population, the 
sample would underestimate the population's diversity. Conversely, if in a popula- 
tion characterized by a few groups of highly similar isolates, we would assess only 
one genome per group, by extrapolating the measurement to the whole population 
we would largely overestimate the overall diversity. An effective, albeit incomplete, 
mitigation of the sampling bias was obtained by replacing the mean of the permu- 
tations with medians, which are more robust indicators of centrality. 

However, in the original analysis of the eight S. agalactiae isolates, one of the 
more surprising results for the experts of the species’ biology, was the lack of any 
specific relatedness among isolates belonging to the same serotypes, indicating that 
the phenotypic criteria used to classify the species thus far had no direct relationship 
with the genomic repertoire of the isolates. From a molecular perspective, this is 
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explained by the fact that genes encoding GBS capsular polysaccharides are part of a 
single locus, and this locus can be transferred across isolates by lateral gene transfer, 
showing how the repertoire of dispensable or strain-specific genes can, under 
specific circumstances, become available to any strain of a given species. 

All in all, the answer to the question “How many isolates do we need to sequence 
to identify all the GBS genes?" was: “there is no such number, the GBS pangenome 
is open.” 

The very idea of an unbounded genomic repertoire for a bacterial species was 
opening the microbiology community to a new way of looking at bacterial species 
and their anatomy. 

While the core-genome remains substantially stable after a few tens of isolates are 
properly sampled, confirming the genomic consistency of the bacterial species 
concept, the more isolates are sequenced the more strain-specific genes merge into 
the accessory genome, expanding the pangenome size. 

The underpinning mechanisms and ecological consequences of these dynamics of 
novelty-generation, spanning the scales of individual mutation, horizontal gene 
transfer promoted by phage transduction, bacterial conjugation or natural transfor- 
mation, and population effects would become the object in the years to come of ever- 
increasing attention of the scientific community (see Fig. 4). A recent example was 
the observation that the majority of the metabolic innovations in the evolution of 
Escherichia coli arose through the horizontal transfer of single DNA segments (Pang 
and Lercher 2019). 


3 The Vocabulary of Life: Heaps’ Law and Pangenomes 


In the initial work on S. agalactiae (Tettelin et al. 2005), the authors used a 
decreasing exponential to model the number of new genes discovered in each new 
genome sequenced. This mathematical function converges asymptotically to a 
constant value (Fig. 3a and blue curves in Fig. 5). The openness of the pangenome 
followed from the fact that the best fit of the exponential function to data indicated an 
asymptotic value significantly higher than zero, i.e., a fixed number of new genes to 
be discovered in each new genome after the first eight sequenced. Although 
comforted by the biological diversity observed, such a conclusion was theoretically 
disturbing because it indicated that, no matter how exhaustively the species would 
have been sampled, the amount of novelty discovered per new isolate would have 
remained, on average, constant. A possibility extremely unusual across a wide 
variety of sampling problems. 

In the subsequent work on H. influenzae (Hogg et al. 2007), the authors proposed 
a different approach, focused on the frequency distribution of genes and on the more 
conservative assumption of a mathematically closed pangenome. However, an 
increasing number of genomes used to train their model, led to larger predicted 
size of the pangenome, pointing again toward pangenome openness. 
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Fig. 5 Power-law regression for new genes (Tettelin et al. 2008). The numbers n of new genes are 
plotted for increasing values of the number N of genomes sequenced. Medians of the distributions 
are indicated by red squares. Blue curves are least-squares fit of the exponential function, as in the 
original pangenome model. Red curves are least-squares fit of the power-law function. The 
exponent a determines whether the pangenome is open (a < 1) or closed (a > 1). The top panel 
shows data for an open pangenome species, P. marinus; the bottom panel for a closed pangenome 
species, S. aureus 


The collaboration with Ciro Cattuto from the Institute for Scientific Interchange 
(ISI) Foundation in Turin offered the opportunity to recognize that determining the 
size of a pangenome was a problem analogous to many similar sampling problems, 
already addressed when dealing with macroscopic characteristics of complex sys- 
tems, including human languages. 

Before delving into the analogy between genomics and linguistics that allowed to 
mathematically solve the pangenome problem, a short diversion into the origins of 
the science of complex systems may be useful. 

Since the 1970s, a few brilliant minds from disparate academic backgrounds, 
realized that challenges and opportunities posed by contemporaneity bear a level of 
complexity exceeding the capacity of established scientific paradigms (Ledford 
2015). In 1984, a small group of Nobel laureates and eminent scientists from 
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Physics, Economics, and other disciplines founded the Santa Fe Institute (Santa Fe 
Institute) with the visionary ambition of creating a novel science called complexity 
(Waldrop 1993). 

That original intuition is at the basis of today’s widespread concept of complex 
system, adopted ubiquitously to deal with biological, ecological, economic, techno- 
logical, and societal systems that cannot be effectively described by linear, inductive 
approaches, because of the nature of the interactions among system’s components, 
and between the system itself and its environment. 

The inductive approach of empirical sciences (i) observes the detailed phenom- 
enology of a system to (ii) infer its underlying dynamics and (iii) uses the inferred 
laws to describe deterministically the macroscopic properties of the system. For 
example, (i) observe the movements of planet Earth, Moon and of the Sun to (ii) infer 
the laws of gravitation and (iii) predict the future trajectory of the planets in the Solar 
system (Newton 1687). 

The approach proposed from the pioneers of complex systems was, in a way, the 
opposite: (1) start by observing macroscopic, statistical properties shared by multiple 
systems, (ii) identify a characteristic common to the disparate systems sharing the 
same property, and (iii) infer generative models, based on that characteristic, capable 
of accounting for the macroscopic properties observed. For example, (i) observe that 
in social networks, such as Facebook, few individuals have many connections, and 
many individuals have few connections, i.e., the frequency “y” of the degree “x” of 
the network nodes follows a power law “y = x%” for some value of the exponent 
alpha; (ii) confirm that the frequency of words in human languages, of genes in 
genomes and of inhabitants in cities, all share the same property described for social 
networks, and all these systems are “modular”, i.e., composed of discrete, connected 
elements; (iii) show that the “preferential attachment" mechanism—according to 
which the more an element is frequent, the higher the likelihood its frequency will 
further increase—can be used to generate systems showing the power-law property 
observed above (Albert and Barabasi 2002). 

A similar thinking process led to the solution of the pangenome problem. The 
rapid accumulation of tens of genome sequences for multiple species had clearly 
shown that the number of new genes discovered per new genome sequenced follows 
a decreasing power law, rather than a decreasing exponential trend (see Fig. 5). A 
similar behavior, for the number of new words discovered upon analyzing increasing 
numbers of instance texts written in English, had been observed decades earlier by 
the mathematical linguist Gustav Herdan (1960) and then generalized by Harold 
Stanley Heaps in the context of information theory as the Heaps' law (Heaps 1978). 

When the number of new genes (or words) discovered is a power law of the 
increasing number of genomes (or text corpora), the overall size of the pangenome 
(or vocabulary) is also a power law, and the mathematical function depends only on 
two parameters: the power-law exponent and a proportionality constant. The rate of 
discovery of new genes is predicted to decrease always toward zero, but the speed of 
the decrease varies by species. With open pangenomes, such a number is just not 
decreasing fast enough for the cumulated number of observed genes to level off. 
Thus, a power-law behavior for the observed number of specific genes allows the 
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possibility of having an open pangenome without requiring that a fixed number of 
new genes be discovered for each new genome. 

In order to complete the approach proposed by the pioneers of complexity, 
extensive work has been dedicated in recent years to the search for generative 
models that would account for the macroscopic properties of pangenomes and 
similar complex systems, including preferential attachment (Albert and Barabasi 
2002), self-organized criticality (Bak et al. 1987; Mora and Bialek 2011), and 
random group formation (Baek et al. 2011). The Heaps’ law, however, is only one 
of such properties displayed by genome data, the other two notable ones being the 
Zipf’s law for the frequency distribution of gene family sizes in complete genomes 
(Huynen and van Nimwegen 1998) and the “U-shaped” gene frequency distribution 
(Haegeman and Weitz 2012). The generative models proposed so far could generate 
some of the observed macroscopic characteristics, but not all at the same time. More 
recently, a novel mechanism based on a sample space-reducing process (Corominas- 
Murtra et al. 2015) was proposed, and shown to reproduce naturally the three major 
properties of pangenomes at once (Mazzolini et al. 2018). Generative processes 
show how a certain system can be built ("generated") following a pre-defined rule or 
mechanism; for example, by choosing the elements of the system from an infinite 
pool of possible components, one after the other randomly. The idea behind the 
sample space-reducing process for the generation of a certain realization (genome, 
book) is that when a component (gene, word) is chosen, that choice restricts the 
space of the possible elements than can be chosen thereafter, permitting only certain 
other components—but not all—to be added. This assumption seems particularly 
relevant for genomic and linguistic systems, where the functioning (for genomes) or 
meaningfulness (for texts) depends on ordered combinations of multiple elements 
(genes in operons, words in sentences) that are not random (after a restriction 
enzyme, only a methylation gene produces a restriction-modification system; after 
a subject, only a verb produces a proposition). For this reason, and considering the 
relative simplicity of its mathematical implementation, the sample space-reducing 
process bears promise in the quest for a deeper understanding of the fundamental 
mechanisms responsible for the generation of pangenomes. 


4 Pangenome Vaccinology 


The existence of species with an open pangenomes has a profound effect on the 
selection of potential vaccine candidates identified by a reverse vaccinology 
approach. Indeed, the accessory genome was found to be an important contributor 
to protein antigens (Mora et al. 2006) implying that, for many bacterial species, a 
protein-based universal vaccine would only be possible by including a combination 
of antigens from the core and the accessory genomes. 

The pathogen population structure and dynamics became a key element of 
vaccine research, paving the way for a modern approach to vaccine discovery 
known as pangenomic reverse vaccinology (Donati et al. 2010; Mora et al. 2006; 
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Budroni et al. 2011). The key principles of this approach, that expands the reverse 
vaccinology paradigm based on a single genome sequence (Rappuoli 2000), include 
reducing bias in isolate selection for genome sequencing (to the extent possible, e.g., 
carriage vs. invasive isolates, or commensal vs. pathogenic isolates) based on 
epidemiology, followed by defining the population genomic structure of the species, 
including its pangenome. 

Reverse vaccinology pipelines are then applied to predict the antigenic potential 
of proteins based on a collection of desired (and undesired features) that they carry, 
for a recent review on reverse vaccinology pipelines, see Dalsass et al. (2019). 
Top-ranked vaccine candidate proteins can then be taken through the experimental 
portion of the vaccine development phase whereby their accessibility and antigenic- 
ity are assayed, for instance starting with antigen-based serological typing [for a 
review on this experimental phase and subsequent phases, please see (Del Tordello 
et al. 2017)]. 

It should be noted that the actual transcription, translation, and exposure of a set 
of selected vaccine protein candidates may vary with the environment, including 
colonization or infection of various organs, and niches within these organs. The 
pangenome can inform on these specificities by including isolates with a propensity 
to target certain organs/niches vs. others (e.g., skin vs. throat isolates of group A 
Streptococcus). Interactions of antigens with host moieties are also key to designing 
successful candidates. 

Ultimately, a combination of pangenomic reverse vaccinology with other multi- 
omics approaches in the context of host-pathogen interactions will better inform the 
rational design of next-generation vaccine targets and will lead to the most promising 
formulations to test in vivo. 


5 Discussion 


In 1946, even before the very discovery of DNA's structure (Watson and Crick 
1953), Joshua Lederberg and Edward L. Tatum had demonstrated the existence of 
"sexual" genetic recombination in bacteria (Lederberg and Tatum 1946), a discovery 
that granted them the Nobel Prize in 1958. 

Horizontal DNA exchange in bacteria has been the subject of intense research 
ever since so, by the time the pangenome came along, the concept of diversity in 
bacterial species had been ingrained in the scientific community for half a century 
already. However, possibly because of the hubris induced by the breakthrough of 
first-generation genome sequencing technologies, for more than a decade the com- 
munity had inadvertently reverted to a “pre-Lederberg-Tatum” mindset, considering 
each of the few genomes generated at the time as representatives of the respective 
species' genetic blueprint. Of note, the name pangenome came to life after many, 
long discussions on how to name this new concept possibly reflecting the paradigm- 
shift that was required, at that time, to recognize the simple evidence of facts. 
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Fig.6 Pangenome bibliometric data: overall number of pangenome publications since 2005. Blue 
curve: Number of pangenome publications in PubMed (https://www.ncbi.nlm.nih.gov/pubmed). 
Query: “pan-genome” [Title/Abstract] OR “pangenome” [Title/Abstract] OR “pan genome" [Title/ 
Abstract]. Orange curve: Number of scientific publications in Google Scholar (https://scholar. 
google.it) mentioning the pangenome. Query: Exact match to anywhere in the text: “pan-genome” 
OR “pangenome” OR “pan genome” 


The pangenome discovery, at first sight, brought the scientific community back to 
Earth, to realize that a single genome was far from describing a whole species and 
that, as a side consequence particularly relevant for the genome pioneers of the time, 
genome sequencing was there to stay as a flourishing business for decades. 

At the same time, though, the pangenome introduced a new dimension in 
microbiology that could hardly be associated with already established awareness: 
the concept that the genetic repertoire of a defined biological entity, such as a 
species, could be unbounded. In a way, the pangenome introduced the infinite in 
biology, with some humble analogy to what Theodosius Dobzhansky had done 
30 years before, much more fundamentally, through the explanatory light of evolu- 
tion (Dobzhansky 1973). 

This could partly explain why, over a relatively short period of time, 
pangenomics became a discipline in itself (see Fig. 6). From a more practical stance, 
the impressive acceleration of bacterial genome sequencing was generating high 
numbers of genes that would not map to species’ reference genomes. The new 
concept offered a conceptual framework to accommodate the wealth of new data, 
becoming rapidly a must have for any microbial sequencing project. 
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Thirteen years later (see Fig. 1d), however, two further elements could be 
identified, that contributed to transforming a specific, empirical question, into a 
discovery that opened the scientific community to a new research field. 

First, a concrete challenge motivated by a burning, unmet medical need, had 
gathered together people with very different backgrounds, spanning from Biology 
and Medicine to Physics and Engineering. This collision model, extensively used in 
modern science and business, promotes ideas that challenge the status quo by 
facilitating cross-fertilization and lateral thinking. Questioning the serotypes, the 
team discovered the pangenome. Simple in hindsight but challenging the established 
paradigm of biological species. 

Second, pioneering technological breakthroughs at the bleeding edge, as it was 
for genome sequencing and assembly at the time, frequently unveils new, unex- 
pected horizons. Not always, though: a critical condition remains the osmotic 
collaboration between scientists and technology experts mastering the data genera- 
tion process, to bring in the team awareness of the limitations intrinsic to the data, 
reducing the risk of hasty misinterpretations, as well as the frustration of missed 
opportunities. 

In conclusion, the pangenome is an early example of mathematical modeling 
applied to biological Big data: a serendipitous, data-driven discovery from a human 
health challenge, fostered by technological breakthroughs and people with different 
backgrounds willing to challenge the status quo. We are deeply grateful to the many 
investigators worldwide who took the pangenome concept well beyond what could 
be envisioned at the time, perfected and expanded techniques and applications, and 
ignited the fascinating evolution of discoveries that the reader now has the oppor- 
tunity to explore in the remainder of this book. 
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The Prokaryotic Species Concept e 
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Abstract Species constitute the fundamental units of taxonomy and an ideal species 
definition would embody groups of genetically cohesive organisms reflecting their 
shared history, traits, and ecology. In contrast to animals and plants, where genetic 
cohesion can essentially be characterized by sexual compatibility and population 
structure, building a biologically relevant species definition remains a challenging 
endeavor in prokaryotes. Indeed, the structure, ecology, and dynamics of microbial 
populations are still largely enigmatic, and many aspects of prokaryotic genomics 
deviate from sexual organisms. In this chapter, I present the main concepts and 
operational definitions commonly used to designate microbial species. I further 
emphasize how these different concepts accommodate the idiosyncrasies of prokary- 
otic genomics, in particular, the existence of a core- and a pangenome. Although 
prokaryote genomics is undoubtedly different from animals and plants, there is 
growing evidence that gene flow—similar to sexual reproduction—plays a signifi- 
cant role in shaping the genomic cohesiveness of microbial populations, suggesting 
that, to some extent, a species definition based on the Biological Species Concept is 
applicable to prokaryotes. Building a satisfying species definition remains to be 
accomplished, but the integration of genomic data, ecology, and bioinformatics 
tools has expanded our comprehension of prokaryotic populations and their 
dynamics. 
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1 The Bacterial Species Challenge 


Are There Bacterial Species? The taxonomy of microorganisms has been delayed 
relative to macroscopic organisms, due in part to technical reasons. Evolutionary 
biologists and population geneticists have originally focused their works on animals 
and plants, which typically engage in sexual reproduction. For these organisms, 
speciation mechanisms involve—directly or indirectly—the sustained interruption 
of gene flow between populations (Dobzhansky 1935; Mayr 1942). The maintenance 
of gene flow warrants the genetic cohesion of populations, but because prokaryotes 
do not engage in sexual reproduction stricto sensu, the definition of species has been 
more elusive in bacteria. It has even been suggested that bacteria cannot and need not 
be organized into species, but rather represent a series of organisms with different 
levels of divergence to one another reflecting their past history (Doolittle and 
Zhaxybayeva 2009; Bapteste et al. 2009). In other words, this view suggests that 
imposing a grouping of bacteria into species would be purely arbitrary and 
unreflective of any biologically-relevant process (e.g., cessation of gene flow). 
However, in practice, microbiologists can usually recognize and designate bacterial 
isolates based on their different phenotypic characteristics, and comparisons of 
bacterial genomes indicate that bacteria form clear clusters of highly related individ- 
uals, instead of showing a scattered distribution (Riley and Lizotte-Waniewski 2009; 
Caro-Quintero and Konstantinidis 2012; Konstantinidis et al. 2017), suggesting that 
they can be organized into species. Ecologically, bacteria can also be identified and 
clustered based on shared niches and properties (Shapiro and Polz 2014). Altogether, 
these observations indicate that bacteria can clearly be grouped into genetically and 
ecologically cohesive entities characteristic of "species", although such species might 
not be defined based on the same criteria as for sexual organisms. The bacterial 
species challenge aims to determine the processes that are shaping and maintaining 
these clusters of cohesive entities. 


Bacterial Genomics and the Case of Escherichia coli Before the advent of 
genotyping methods, microbiologists had to rely exclusively on phenotypic traits to 
characterize and classify bacteria. Such phenotypic observations offer one criterion 
for building a species concept, similar to the early approaches used by naturalists to 
classify animals and plants. However, these early observations showed that it might 
not be that simple. The seminal work of Oswald Avery and colleagues had strong 
implications in the field of biology by identifying that DNA—not proteins—was the 
support of heredity (Avery et al. 1944). But this experiment and previous others 
further demonstrated that some phenotypic traits could be transmitted horizontally 
from one bacterial cell to another (Griffith 1928). Although it took several decades to 
fully understand the extent of horizontal gene transfer in bacteria, this challenging 
observation contrasted with animals and plants where traits are almost exclusively 
inherited vertically (i.e., from parent to offspring), indicating that something about 
bacteria was profoundly different. The development of genetic and genomic tech- 
niques further revealed how deeply bacterial genomics differed from animals and 
plants: related bacteria can differ dramatically in their gene contents and what is 
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typically considered as a bacterial species presents a set of ubiquitous and highly 
similar genes, the core-genome, but also a set of accessory genes (also called 
dispensable, flexible, or auxiliary genes) presenting a scattered distribution (Vernikos 
et al. 2015). The pangenome represents the total gene diversity of a given population: 
this comprises the total number of distinct orthologs, including core genes and 
accessory genes (Tettelin et al. 2005; Medini et al. 2005; Vernikos et al. 2015). 

The bacteria Escherichia coli perfectly illustrates the genomic versatility of pro- 
karyotes. E. coli contains approximatively 4400 genes for its model strain K12 
MG1655 (Hayashi et al. 2006), but other strains contain up to an additional 1000 
genes encoding for a variety of functions (Hayashi et al. 2001). The comparison of 
only 20 strains of E. coli shows that the set of genes shared by all strains—the 
core-genome—is composed of approximately 2000 genes, but its pangenome 
approaches readily 18,000 genes (Touchon et al. 2009) and the inclusion of addi- 
tional strains would necessarily increase this number, as suggested by resampling 
analyses (Touchon et al. 2009). These numbers indicate that over 50% of the genes 
of a single strain of E. coli consist of accessory genes that do not contain orthologs in 
the majority of all other strains. Importantly, most of these accessory genes are 
typically restricted to a single or a small subset of strains, but are often exchanged 
between strains (Groisman and Ochman 1996; Gogarten et al. 2002; Touchon et al. 
2009). Many strains of E. coli possess different lifestyles and ecologies broadly 
ranging from environmental to commensal or pathogenic and these differences can 
be primarily ascribed to their specific sets of accessory genes (Luo et al. 2011). For 
example, virulence genes represent a category of extensively studied accessory 
genes and they appear to be frequently exchanged during E. coli’s evolution 
(Groisman and Ochman 1996; Gogarten et al. 2002). 

Although E. coli strains present different phenotypes and many different assem- 
blages of accessory genes, they still form a cohesive entity since they share a large 
number of core genes that are highly similar between all strains of E. coli (typically 
>98% of sequence identity) (Bobay et al. 2013). This situation is problematic for 
applying phenotype-based classifications in microbiology, as emphasized by the case 
of Shigella. This bacterial “genus” comprises four recognized species (i.e., S. flexneri, 
S. boydii, S. sonnei, and S. dysenteriae), which have been grouped based on shared 
phenotypic properties (i.e., they are obligate pathogens) (Rolland et al. 1998; Pupo 
et al. 2000; Escobar-Paramo et al. 2003). However, genomic analyses showed that 
Shigella possesses the same core-genome as E. coli with an average of >98% of 
sequence identity across core genes and core-genome phylogenies revealed that 
Shigella do not form a monophyletic clade (Touchon et al. 2009). What unites 
Shigella together is the presence of shared virulence genes (Buchrieser et al. 2000; 
Touchon et al. 2009), their serology, and their incapacity to ferment lactose or 
decarboxylate lysine (Hale and Keusch 1996). In other words, Shigella constitutes 
a subset of E. coli's strains with a shared phenotype conferred by the independent gain 
of a common set of accessory genes by horizontal gene transfer. It is now recognized 
that Shigella are part of the E. coli species, but its taxonomy has not been revised. This 
example illustrates that the pangenome and its evolutionary dynamics represent a 
challenge to disentangling the complex relationship between phenotypes, ecology, 
and genomics in bacteria and how these characteristics correlate with taxonomy. 
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2 Species Concepts and Operational Definitions 


Pragmatic Approaches: Sequence Thresholds One of the goals of a taxonomy is to 
facilitate communication in the scientific community. To satisfy the need of a 
coherent microbial taxonomy, pragmatic approaches have been developed in order 
to define species based on genetic or genomic similarities. Although this does not 
directly offer insight into how and why a given set of strains constitutes a species, a 
threshold-based method provides a convenient means to classify strains and revise 
taxonomy as more comparative genomic data become available. Due to the lack of a 
theoretical framework of these approaches, such threshold-based methods are often 
said to define Operational Taxonomic Units (OTUs) rather than “species” to empha- 
size that this is only an operational definition. 

Before the rise of the genomic era, species membership was established by shared 
phenotypic traits and by DNA-DNA hybridization essays, which consist of com- 
paring a newly isolated strain to a reference strain (Brenner et al. 2000) (note that 
other criteria such as GC content were also considered). The recommended threshold 
to define species membership was set at 70% of genomic hybridization to the 
reference strain (Brenner et al. 2000). The emergence of sequencing technologies 
led to the rise of related approaches. The 16S rRNA subunit has been identified as a 
universal gene shared by all bacteria and archaea (Woese and Fox 1977) offering the 
possibility to assess prokaryotic species membership with the same gene marker 
across all lineages. Analyses revealed that the threshold of 7096 identity based on 
DNA-DNA hybridization assays corresponds approximately to a threshold of 97% 
identity when using the 16S rRNA subunit (Stackebrandt and Goebel 1994; Ludwig 
and Klenk 2000; Richter and Rossello-Mora 2009). The use of 16S rRNA thresholds 
can be applied with ease and allows for the identification of a species by sequencing 
a single locus. OTU-typing based on the 16S rRNA gene became even more popular 
with the rise of metagenomic sequencing, where the amplification and sequencing of 
a fragment of the 16S rRNA gene provides a direct overview of the taxonomic 
diversity of a given sample without the need of cultivating any of its members. A 
more recent approach consists of using the entire genome of a strain to calculate the 
Average Nucleotide Identity (ANI) across all the genes relative to a reference 
genome of the species (Konstantinidis and Tiedje 2005; Richter and Rossello- 
Mora 2009). Because protein-coding genes are not as selectively constrained as 
the 16S rRNA subunit, the ANI threshold used to attain species membership has 
been empirically defined as 9596 based on correlations with 16S sequence threshold 
used to define species (Konstantinidis and Tiedje 2005; Richter and Rossello-Mora 
2009). Considering complete genomes obviously offers a more accurate resolution 
of sequence divergence. 

Sequence thresholds based on single loci or entire genomes present the advantage 
of defining all prokaryotic species under a standardized framework, but, despite their 
simplicity, they suffer several technical difficulties. Sequences of the 168 rRNA 
subunit evolve very slowly and thus sequences from related strains or species 
typically display little or no informative differences (Kettler et al. 2007). Moreover, 
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multiple copies of the 16S rRNA gene are frequently found in the same genome and 
they sometimes exhibit different levels of divergence (Acinas et al. 2004). In several 
cases, the different 16S rRNA copies present in the same genome can display 
remarkable levels of divergence, such as Thermoanaerobacter tengcongensis, 
which presents 11.6% of sequence divergence between its most different 16S 
rRNA copies (Acinas et al. 2004). Comparing these sequences would lead to the 
ironic conclusion that the same bacterial isolate should be classified into two distinct 
species. A more common criticism against 16S rRNA thresholds is that the diver- 
gence of the 16S rRNA gene does not always accurately reflect overall genomic 
divergence. For instance, the marine bacterium Prochlorococcus can be classified as 
a single species based on 16S rRNA sequences but some strains display only 66% 
genome-wide identity based on ANI methods (Zhaxybayeva et al. 2009). ANI 
thresholds are recognized as much more reliable criteria to define species and 16S 
rRNA alone is of little taxonomic value when complete genome sequences are 
available (Richter and Rossello-Mora 2009). However, ANI-based methods also 
suffer inconsistencies. Sequence identity might not be constant along the entire 
genome (Retchless and Lawrence 2007, 2010) and the identity thresholds used to 
infer gene orthology can therefore affect the overall ANI value. Perhaps more 
importantly, ANI metrics are frequently computed against a single reference genome 
to assess species membership, but the choice of reference genomes is largely 
arbitrary and historically contingent. In other words, species borders can vary 
depending on which—or how many—genomes are used as a reference. Finally, 
using a fixed sequence threshold does not account for the different rates of genomic 
evolution across phyla (Hugenholtz et al. 2016), which are dictated by parameters 
like mutation rates, selection coefficients, and effective population sizes (Shapiro 
2014) that vary across prokaryotic lineages. Other mechanisms might further lead to 
differential rates of evolution such as the lack of DNA repair systems (Dorer et al. 
2011). Bacterial endosymbionts notoriously evolve at faster rates due to less effec- 
tive selective pressures imposed by their reduced population sizes (Moran 1996; 
Moran et al. 2009). As a consequence, the sequence threshold constituting a species 
in symbiotic bacteria likely corresponds to a different time scale in free-living 
bacteria (Parks et al. 2018). As a result of all these issues, applying sequence 
thresholds to define species is convenient but does not anchor a bacterial species 
concept on a solid theoretical framework. 


Phylogenetic Concept Phylogenetic approaches offer another means to classify 
species. As for sequence thresholds, phylogenetic methods are also a pragmatic 
approach to define species, although phylogenetic species are defined in the context 
of evolutionary history (De Queiroz and Gauthier 1994). Besides taking sequence 
divergence into account, phylogenies typically require species and other taxa to 
constitute monophyletic groups. Although the concept of monophyly is usually a 
key feature researched by phylogenetic approaches, it has been argued that exclu- 
sivity might be preferable over monophyly (Velasco 2009; Wright and Baum 2018). 
Exclusivity is defined as groups of strains/taxa that are more related to one another 
than other groups without being necessarily monophyletic (Velasco 2009; Wright 
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and Baum 2018). A recent study focusing on Streptomycetaceae and Bacillus found 
that exclusive clades can be defined for these taxa, although no objective threshold 
appears universal (Wright and Baum 2018). An additional and nontrivial advantage 
of phylogenetic methods is their ability to inform other levels of relationships (e.g., 
genus and family) and are not restricted to delimiting species. Multiple genome- 
based phylogenies have been constructed for taxonomic purposes (Garrity 2016; 
Hugenholtz et al. 2016; Yoon et al. 2017; Parks et al. 2018) and offer a more accurate 
resolution than 16S rRNA phylogenies (Brochier et al. 2005; Ciccarelli et al. 2006; 
Thiergart et al. 2014). Akin to sequence thresholds, phylogenetic approaches fre- 
quently rely on a single threshold (e.g., a phylogenetic distance) to define species, 
but recently, a new approach has been developed to reclassify all prokaryotic 
organisms, while correcting for the uneven evolutionary rates across the tree 
(Parks et al. 2018). Such approaches offer a universal framework to classify spe- 
cies—and other taxonomic ranks—across the Tree of Life, while correcting for 
uneven rates of evolution (i.e., defining species with lineage-specific thresholds). 
The application of these approaches is much more cumbersome than 16S and ANI 
thresholds, but online tools and resources to place newly sequenced genomes in a 
reference phylogenetic tree are now available (Parks et al. 2018). The development 
of such tools and the maintenance of online resources offer the possibility to classify 
all prokaryotic genomes with ease into a single phylogenetic framework. Although 
phylogenic methods offer many advantages over sequence threshold methods, they 
also require comprehensive taxon sampling and can be affected by the underlying 
phylogenetic model used to reconstruct the tree. Finally, a phylogenetic species 
concept is still based on ad hoc criteria and does not ambition to identify species 
based on an explicit speciation model. 


The Stable Ecotype Model The stable ecotype model (SEM) is a theoretical 
framework of bacterial evolution, upon which a microbial species concept can be 
founded (Cohan 2001; Wiedenbeck and Cohan 2011). In a world without sex, new 
beneficial alleles can only reach fixation through genome sweep (i.e., fixation of the 
entire genotype). Therefore, the competition of different bacterial strains for the 
same resources (the same niche) would lead periodically to the fixation of a single 
genotype. This model of periodic selection implies that most of the diversity of a 
species is periodically erased, thereby maintaining genetically cohesive entities, i.e., 
species. Thus, the SEM has the capacity to explain why bacteria form clusters of 
genomically similar entities. Under this framework, speciation is expected to occur 
when one strain gains the ability to colonize a different niche (Wiedenbeck and 
Cohan 2011). By colonizing a different niche, this new population would stop 
competing against the original population and would not be lost by the periodic 
selection of a successful genotype of the original population. Note that from the 
bacterial point of view, a new niche could be as simple as the presence of a new type 
of carbohydrate and multiple niches are expected to overlap in nature. 

A theoretical difficulty of the SEM became apparent when comparing the gene 
content of bacteria. It became clear that the gene content of a single strain typically 
represents a very small fraction of the total gene repertoire of the species (i.e., the 
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pangenome) (Tettelin et al. 2005; Medini et al. 2005; Vernikos et al. 2015). This 
implies that the genetic cohesion of microbial species is only true for a restricted 
fraction of their genes: their core-genome (Lapierre and Gogarten 2009). The 
scattered distribution of various accessory genes across strains sharing a highly 
conserved core-genome cannot be easily reconciled with the SEM. Although a 
substantial fraction of the pangenome corresponds to mobile elements (Bobay et al. 
2013), accessory genes often contribute to the colonization of different niches 
(Ochman et al. 2000), which implies that the gain and losses of these genes can 
provide the capacity of a strain to colonize a new niche. This would lead to the 
disturbing conclusion that a given strain could frequently change species membership 
by gaining or losing specific sets of accessory genes. Because each genotype virtually 
contains its own set of accessory genes, each strain could be ascribed to a different 
ecotype and could be viewed as its own species (Doolittle and Zhaxybayeva 2009; 
Wiedenbeck and Cohan 2011). This extreme scenario, however, would fail to explain 
why many bacterial strains present a nearly identical core-genome. 

Although the SEM does not easily accommodate the large diversity of accessory 
genes observed in related bacteria, it has been argued that the definition of an ecotype 
could be more flexible by encompassing multiple sub-niches (the “nano niche” 
model) (Wiedenbeck and Cohan 2011). Some strains of a community can acquire 
alleles or accessory genes specialized in a sub-niche, while remaining part of a 
broader ecologically-cohesive entity. These specialized strains within an ecotype can 
be perceived as new species in the making. Nascent speciation might be constantly 
occurring but need not lead to full speciation (Shapiro and Polz 2014) and this could 
potentially explain the vast pangenome diversity in bacterial species. Alternative 
mechanisms have been hypothesized to explain the extensive gene diversity within 
ecotypes such as a high turnover of accessory genes (Doolittle and Papke 2006) or 
ecological processes maintaining bacterial diversity such as phage predation (“kill 
the winner” hypothesis) (Rodriguez-Valera et al. 2009; Thingstad and Lignell 1997) 
or negative frequency-dependent selection (Cordero and Polz 2014). 

While the SEM and related models could provide a coherent explanation of the 
observation of genomic clusters in the bacterial world—or at least their core- 
genomes—few results have reported genome sweeps as predicted by the periodic 
selection expected under the SEM. Multiple studies have overwhelmingly observed 
that gene sweeps rather than genome sweeps tend to occur under natural conditions 
(Simmons et al. 2008; Shapiro et al. 2012; Cadillot-Quiroz et al. 2012; Bendall et al. 
2016). These results contradict one assumption made by the ecotype model: recom- 
bination is negligible relative to selection. Evidence of homologous recombination 
has been reported for the vast majority of analyzed prokaryotic species (Vos and 
Didelot 2009; Bobay and Ochman 2017a). That some evidence of homologous 
recombination exists for most species does not necessarily imply that the rates of 
homologous recombination are high enough to counteract genome sweeps. A more 
pertinent metric consists of comparing recombination rate relative to selection: the 
ratio r/s (Shapiro and Polz 2014). If selection is overwhelmingly strong relative to 
recombination, the selected genome is expected to reach fixation before the advan- 
tageous alleles are transferred to other genotypes. Because gene sweeps have been 
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more frequently observed than genome sweeps in bacterial species, it seems that the 
relatively modest levels of homologous recombination in bacteria—in comparison to 
truly sexual organisms—would suffice to prevent genome sweeps unless extremely 
beneficial alleles are introduced. 

Overall, the accumulation of empirical observations of gene sweeps in natural 
populations suggest that periodic selection might play a limited role in maintaining 
genomic cohesion in bacteria. Nevertheless, the SEM remains relevant for effec- 
tively clonal species (species with negligible rates of recombination), although the 
previously cited studies suggest that relatively few species might be effectively 
clonal (Vos and Didelot 2009; Bendall et al. 2016; Bobay and Ochman 20172). 
An inherent difficulty of the SEM and other ecology-based definitions, in general, is 
the difficulty to gain accurate knowledge on microbial ecology and to identify what 
objective criteria can be used to define distinct niches. This lack of ecological data 
appears even more dramatic when compared to the colossal accumulation of geno- 
mic data. In the (meta-)genomic era, alternative approaches are needed. Starting 
from this observation, several authors have suggested the use of a reverse ecology 
approach, where, instead of searching for the genetic variants responsible for 
ecological segregation, it is more relevant to search for the ecological factors 
associated with allelic or accessory gene segregation (Shapiro and Polz 2014). The 
development of a reverse ecology framework potentially offers a powerful tool to 
extend our comprehension of the ecological factors driving the evolutionary dynam- 
ics and the cohesion of bacterial species. 


Biological Species Concept Sexual organisms engage in meiotic recombination at 
each generation and this maintains the genetic cohesion of species (Mayr 1942). The 
mechanisms leading to speciation in sexual organisms are diverse, can be either pre- 
or post-zygotic in nature, and are often conceptualized in the context of spatial 
arrangement of populations (sympatric or allopatric) (Coyne and Orr 2004; De 
Queiroz 2007). Most models assume that prolonged interruption of gene flow 
(e.g., zero or few migrants per generation) between two separated populations can 
lead to the independent accumulation of new alleles and new traits in each popula- 
tion through drift or local adaptation, leading to build up of reproductive incompat- 
ibilities and potentially triggering reinforcement, if the two populations are reunited. 
Other mechanisms, such as the appearance of incompatible alleles or alleles resulting 
in mating preferences, or even genomic duplications or rearrangements, can also 
lead to sexual barriers and, therefore, to the interruption of gene flow between 
populations. While evolution of reproductive barriers is often associated with spe- 
ciation, it is important to realize that the interruption of gene flow can be either the 
cause or the consequence of speciation. In all scenarios, however, the interruption of 
significant gene flow remains associated with speciation, even if the barriers of gene 
flow can remain somewhat permissive after speciation (Mallet et al. 2007, 2016). 
Although bacteria do not engage in true sexual reproduction, it has long been 
known that they are capable of exchanging DNA (Smith et al. 1993). Because gene 
flow is a common phenomenon across plants and animals as well as bacteria, this 
opens the possibility to define bacterial species with the same standards of the 
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biological species concept (BSC) (Dykhuizen and Green 1991; Fraser et al. 2009; 
Bobay and Ochman 2017a). The fact that bacteria have the capacity to exchange 
DNA does not necessarily imply that they form biological species; instead, the real 
challenge is to determine whether the strength of gene flow is sufficient to shape 
cohesive bacterial units in bacteria, and thus whether common speciation models 
based on gene flow are applicable to bacteria as well. The question is then: how 
much and how frequently do they recombine? Can we detect these patterns of gene 
flow in bacteria as we do for sexual organisms? By “gene flow", I exclusively refer to 
the replacement of DNA sequences by homologous recombination (also referred to 
as gene conversion). Homologous recombination consists of the exchange between 
two sequences of DNA that typically display a high identity in nucleotide compo- 
sition (Vulic et al. 1997). In contrast to gene flow, horizontal gene transfer (HGT) 
refers to the gain of new genetic material without the replacement of a homologous 
sequence. This semantic differentiation allows for the distinction of gene segments 
of homologous genes that are exchanged (gene flow) versus new genes that are 
gained (HGT). Note that this distinction permits the differentiation of the outcome of 
the DNA transfer—homologous replacement or gain of DNA—but it does not 
necessarily involve different molecular mechanisms since HGT can involve homol- 
ogous recombination between regions flanking the exchanged sequence (Mell et al. 
2011; Croucher et al. 2012; Cordero et al. 2012; Everitt et al. 2014). 

Two independent studies have scrutinized a relatively large range of prokaryotic 
species and came to the conclusion that a small proportion (<15%) of analyzed 
species do not show substantial signs of gene flow (Vos and Didelot 2009; Bobay 
and Ochman 201 72). In fact, similar numbers were estimated for viruses and there is 
growing evidence that the vast majority of cellular and acellular organisms engage in 
gene flow (Bobay and Ochman 20182). In addition, many studies have reported that 
individual loci—rather than entire genotypes—sweep through natural populations 
(Simmons et al. 2008; Croucher et al. 2011; Shapiro et al. 2012; Cadillot-Quiroz 
et al. 2012; Bendall et al. 2016; Bao et al. 2016; Porter et al. 2017). These 
observations imply that gene flow is substantial enough to spread alleles—and 
even beneficial ones—to the entire population, suggesting the cohesive role of 
gene flow in bacterial genome dynamics. Importantly, the levels of gene flow across 
most bacterial species—and their variations—are often substantial enough to be 
detected using genomic datasets (Bobay and Ochman 20172). Thanks to the vast 
accumulation of genomic data, it is possible to identify strains that do not engage in 
gene flow with the rest of the species (i.e., sexual isolation) by conducting large- 
scale resampling analyses. This allows to classify sexual eukaryotes, bacteria, 
archaea, and even viruses under a unique BSC-based species definition. 

The delimitation of species based on gene flow is more cumbersome than ANI 
sequence thresholds, since it requires identification of the core-genome (or a portion 
thereof) for the tested genome sample and estimation of distances or tree topologies 
and potentially conducting resampling analyses (Bobay and Ochman 2017b). Sim- 
ilar to phylogenetic methods, it is also possible to compare individual genomes to a 
database of preprocessed species available online (i.e., ConSpeciFix) (Bobay et al. 
2018), which facilitates the classification of newly sequenced data. Detecting and 
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quantifying gene flow remains a delicate endeavor as evidenced by the lack of a 
consensual methodology to infer homologous recombination. Various methods to 
estimate recombination rates exist, but they often rely on different models and 
assumptions regarding the recombination process (Didelot and Falush 2007; 
Marttinen et al. 2012; Yahara et al. 2014, 2015; Didelot and Wilson 2015; Mostowy 
et al. 2017), and this contributes to the inference of inconsistent estimates of 
recombination rates across studies (Bobay et al. 2015). Recently, we introduced a 
methodology based on the quantification of homoplasies to detect gene flow across 
large genomic datasets (Bobay and Ochman 2017a; Bobay et al. 2018). Homoplasies 
are polymorphisms incompatible with vertical inheritance from a shared ancestor 
and are mostly introduced by gene flow (Bobay and Ochman 2017a). Although the 
ratio between homoplasic and non-homoplasic polymorphisms does not provide an 
accurate metric to quantify recombination rates, the detection of homoplasies is 
rather straightforward and does not rely on complex model assumptions and over 
parametrization. Interestingly, this homoplasy-based approach appears more robust 
to genome resampling and gene bootstrapping when compared to ClonalFrameML 
(Bobay and Ochman 2018b). Inferring gene flow based on homoplasies is limited to 
the detection of recombination events internal to the dataset and the method does not 
aim to model imports from external sources. Recombining species can sometimes be 
misclassified as clonal when multiple sexually isolated genomes are included in the 
analysis and the sample size is too small to resample and test subpopulations for gene 
flow; thus, the method is most efficient when large datasets are available and when 
genetic diversity is high. This limitation will be resolved as more genomes will be 
sequenced, but, to this date, the analysis of several species can remain inconclusive 
due to ambiguous signals (Bobay and Ochman 2017a). In addition, the recent 
accumulation of metagenomic data combined with the development of bioinformat- 
ics tools that resolve strain genotypes within metagenomic samples (Nayfach et al. 
2016; Pasolli et al. 2017; Truong et al. 2017) constitutes a new source of data readily 
exploitable to define species based on gene flow. 

Because bacteria can sometimes gain genes from other species through HGT, it 
has been argued that bacteria might not fit a BSC definition in comparison to truly 
sexual organisms. Species borders are somewhat “fuzzy” for bacteria (Hanage et al. 
2005; Hanage 2013) and many studies have detected HGT events in prokaryotes, 
leading to the conclusion that they might be genomically promiscuous (Popa and 
Dagan 2011). It should be emphasized, however, that gene flow between species 
remains very rare when considering the overall time scale of prokaryote evolution, 
and HGT events occur primarily between related bacteria (Popa et al. 2011). In 
contrast, gene flow within species is expected to occur at much higher frequencies 
relative to the acquisition of new genes from external species by HGT (Caro-Quintero 
et al. 2009; Cadillot-Quiroz et al. 2012; Shapiro et al. 2012; Krause and Whitaker 
2015; David et al. 2017). Comparison of ~100 species indicates that most bacteria 
show clear signs of gene flow and the same method can also retrieve species borders 
in well classified animals such as humans and Drosophila (Bobay and Ochman 
2017a). It is well established that sexual eukaryotes are not as well isolated as 
previously thought (Danchin and Rosso 2012; Syvanen 2012), but introgression 
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and incomplete lineage sorting do not typically prevent defining species borders in 
truly sexual organisms (Mallet et al. 2016). Although eukaryotic and prokaryotic 
species borders can be “leaky” and occasionally allow gene flow from external 
sources, this process need not be prevalent enough to blur species borders (Mallet 
2008). 

Given the commonality of genomic exchange across diverse types of organisms, a 
BSC-based definition allows the use of a universal species concept to classify all 
lifeforms under a biologically relevant definition. What are the implications of 
applying such a species concept to microbes? Most BSC-species (i.e., bacterial 
species classified based on the BSC) correspond to closely related genomes that 
typically present >95% ANI (Bobay and Ochman 2017a). However, this is not 
always true since several BSC-species contain genomes that would not be classified 
as members of the same species based on ANI thresholds and, conversely, other 
BSC-species were found to exclude members that would be part of the same species 
according to ANI thresholds (79596 ANI) (Bobay and Ochman 20172). These results 
are in agreement with analyses showing that a single ANI or phylogenetic threshold 
fails to define consistent species across prokaryotes (Parks et al. 2018; Wright and 
Baum 2018). These differences can be putatively ascribed to the use of more-or-less 
permissive recombination mechanisms across species. Experimental data have 
suggested that the frequency of homologous recombination decreases exponentially 
with sequence divergence (Roberts and Cohan 1993; Zawadzki et al. 1995; Vulic 
et al. 1997; Majewski and Cohan 1998; Majewski et al. 2000) due to the action of the 
mismatch repair system (Matic et al. 2000). These observations suggest a simple 
model of sexual isolation in bacteria. The action of the mismatch repair system seems 
highly variable across taxa (Majewski 2001), which suggests that barriers of gene 
flow driven by sequence divergence would also be variable across species. In contrast 
to these observations, there is no systematic negative correlation between recombi- 
nation and sequence divergence (Bobay and Ochman 20172) and gene flow has been 
reported between bacteria presenting relatively divergent genomes (Sheppard et al. 
2008; Mell et al. 2011; Cordero et al. 2012), suggesting that sequence divergence 
plays a limited role in establishing barriers of gene flow. These discrepancies between 
experimental data and genome analyses can be explained by multiple factors. Firstly, 
gene flow is detected by the exchange of polymorphisms, and recombination events 
that do not result in any exchange of polymorphisms can remain invisible to some 
approaches. This implies that the rates of recombination between highly similar 
genomes are frequently underestimated. Secondly, selection can potentially have a 
strong impact in selecting—positively or negatively—alleles exchanged by gene 
flow, mirroring adaptive introgression or Dobzhansky-Muller incompatibilities in 
sexual organisms (Mallet et al. 2016). Finally, a simpler explanation might account 
for these discrepancies. The exponential relationship between sequence identity and 
recombination rate is based on the observation that nearly identical regions flanking 
the recombination tract —the minimum efficiently processed segments (MEPS)—are 
needed to initiate recombination (Shen and Huang 1986; Wiedenbeck and Cohan 
2011; Hanage 2016). However, sequence identity need not be high along the entire 
segment of recombined DNA because recombination requires high sequence identity 
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only along the MEPS, which are only ~26 nt long (Shen and Huang 1986; 
Wiedenbeck and Cohan 2011; Hanage 2016). This suggests that more variable 
sequences of DNA might be exchanged as long as a few clusters of nearly identical 
nucleotides remain available to initiate homologous recombination. 


Mixed Model The SEM and a BSC-like model of bacterial evolution need not be 
fundamentally opposed. A BSC-like model is, by definition, unable to define species 
borders for clonal species. It is also likely that species with low rates of recombina- 
tion would appear effectively clonal when analyzing genomic data, meaning that the 
BSC will fail to accurately delimit species in some bacterial groups. For these clades, 
the SEM appears the most pertinent force maintaining genetic cohesion and there- 
fore is most appropriate to define the borders of these species. The fact that very few 
studies have reported genome sweeps relative to gene sweeps suggests the preva- 
lence and significance of recombination in bacteria and implies that the vast majority 
of bacterial species can be defined based on the BSC. Both models could, therefore, 
be integrated to define species; the SEM for lineages that are effectively clonal and a 
BSC-like model for species that appear effectively sexual. A key distinction between 
both models is that the SEM is inherently ecologically centered, whereas a 
BSC-based model of bacterial evolution does not necessarily involve ecological 
mechanisms. However, the speciation processes through new niche colonization 
assumed under the SEM can also lead to speciation under the BSC. 


3 Speciation: From Maintenance to Disruption of Genomic 
Cohesion 


Neutral Processes Simulations have provided insightful answers regarding the 
impact of neutral evolution on the formation of new species. In the absence of 
recombination, it is expected that some distinct genome clusters would emerge in 
sympatry (Fraser et al. 2007). However, most of these newly emerged clusters are 
expected to go extinct through drift. On the other hand, gene flow allows populations 
to maintain cohesive genomes (Fraser et al. 2007; Friedman et al. 2013). These 
results suggest that neutral evolution is unlikely to promote the emergence of new 
species in bacteria, especially in the case of recombining populations. It has been 
noted that this neutral model of speciation does not consider the potential barrier of 
gene flow imposed by sequence divergence (Fraser et al. 2007), in which case, it may 
be possible that divergent genome clusters become more and more sexually isolated. 
It should be underlined, however, that neutral evolution is expected to drive diver- 
gence very slowly, and due to the frequent loss of newly emerged clusters by drift, it 
is unlikely that population clusters would accumulate enough mutations to impose a 
substantial barrier of gene flow. 


Geography The previous model of neutral speciation has been developed for 
sympatric populations (i.e., geographically overlapping populations), which is 
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thought to be the preponderant situation in bacteria (Vos 2011; Shapiro and Polz 
2015). However, geographic differentiation suggests that allopatric speciation could 
occur in bacteria (Simmons et al. 2008; Denef et al. 2010; Whitaker et al. 2003; Reno 
et al. 2009; Krause and Whitaker 2015). Processes resembling allopatric speciation 
with the interruption of gene flow in bacteriophages targeting different receptors 
have even been observed in an experimental evolution setting (Meyer et al. 2016). 
The impact of geography remains elusive since species spanning large continental 
and oceanic distributions can remain genetically cohesive (Papke et al. 2007; 
Coleman and Chisholm 2010; Boucher et al. 2011). Recent modeling work has 
emphasized the impact of niche overlap in bacterial speciation, further revealing the 
importance of habitat structure in promoting genomic isolation, especially for 
recombining bacteria (Marttinen and Hanage 2017). The spatial dynamics of micro- 
bial distributions remains difficult to characterize and seemingly overlapping 
populations might not necessarily encounter each other due to fine-scale habitat 
structure (i.e., mosaic sympatry) (Mallet 2008; Shapiro and Polz 2014). 


Recombination Barriers As mentioned above, the initiation of homologous recom- 
bination requires the presence of nearly identical short sequences (i.e., MEPS) (Vulic 
et al. 1997; Majewski and Cohan 1999) and, although relatively divergent sequences 
can engage in gene flow, sequence divergence can affect recombination rates due to 
the frequency of available MEPS to initiate recombination. Interestingly, the 
sequence (MEPS) conservation required to initiate recombination seems to be depen- 
dent on the mismatch repair (MMR) system (Matic et al. 2000), which can be more or 
less permissive across species and strains. The evolution—and sometimes the com- 
plete loss—of the MMR system is therefore expected to have a strong impact on 
sexual isolation in prokaryotes. 

Restriction- Modification (RM) systems are frequently used by bacteria to protect 
themselves against mobile elements and, in particular, bacteriophages (Thomas and 
Nielsen 2005; Labrie et al. 2010). The presence of different RM systems across 
strains and species can lead to incompatibilities of gene flow and this has been found 
to regulate and structure gene flow (Oliveira et al. 2014, 2016). Consequently, the 
gain or loss of RM systems can have direct consequences on the interruption of gene 
flow and can potentially lead to speciation. In theory, CRISPR-Cas systems might 
exhibit similar properties, but since they specifically target a limited number of 
sequences, they are unlikely to introduce genome-wide incompatibilities. Because of 
these properties, RM systems can shape the networks of gene flow and the popula- 
tion structure of bacterial species. These systems might drive the establishment of 
durable barriers of gene flow, potentially leading to speciation. 

Gene flow relies on the presence of different vectors and mechanisms capable of 
disseminating and capturing DNA. The three main mechanisms of DNA transfer, 
namely transformation, conjugation, and transduction, present diverse degrees of 
specificity. (i) Transformation does not require cell-to-cell interactions, since envi- 
ronmental DNA is directly taken up by the cell; but recipient cells need to be 
competent, and relatively few bacteria are known to naturally engage in this process 
(Johnston et al. 2014). Some bacteria engaging in transformation such as Neisseria 
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and Pasteurellaceae require the presence of specific DNA uptake sequences or 
uptake signal sequences (Goodman and Scocca 1988; Scocca et al. 1974; Danner 
et al. 1982), thereby restricting the range of potential DNA donors to related lineages. 
Moreover, due to the rapid degradation of DNA when released in the environment 
this mechanism likely requires close proximity between cells, suggesting that trans- 
formation might only mediate gene flow between sympatric populations. 
(ii) Conjugation involves more constrained transfers of DNA through cell-to-cell 
contacts, which is mediated by specific pilus interactions and type IV secretion 
systems (Guglielmini et al. 2013). These conjugative transfers occur primarily 
between conspecifics, although plasmids have been shown to be occasionally 
exchanged across much more divergent lineages (Smillie et al. 2010). Because this 
process requires the direct contact of cells, gene flow mediated by the conjugative 
apparatus must also occur in sympatry. (iii) Transduction is another route for gene 
flow where bacterial DNA is packaged within phage particles or gene transfer agents 
(GTAs) (Lang and Beatty 2007; Popa and Dagan 2011). Phage particles are rarely 
able to infect multiple species and are often restricted to a subset of strains (Popa et al. 
2017). As opposed to transformation and conjugation, phage particles can potentially 
transport DNA over longer distances (and potentially for long periods of time), 
suggesting that allopatric—and perhaps anachronistic—populations are able to 
engage in some levels of gene flow without requiring migration. These three mech- 
anisms, and especially conjugation and transduction, rely on specific molecular 
signals and are typically restricted to conspecific cells. The overall specificity of 
these mechanisms is expected to favor gene flow within species rather than between 
species. Conjugation and transduction also potentially have important consequences 
for bacterial speciation, since the loss of cell-vector specificity can lead to the partial 
or complete interruption of gene flow. 


Selection As mentioned above, neutral processes are unlikely to lead to bacterial 
speciation, especially in the case of sympatric recombining populations that co-occur 
at fine spatial scales (Fraser et al. 2007). This suggests that selection must initiate the 
formation of distinct genomic clusters, which might eventually lead to selection 
against genetic intermediates and the cessation of gene flow (Shapiro 2014). Eco- 
logical specialization is thought to be a strong force leading to speciation, since the 
nascent species will present differentially selected EcoSNPs or specialized accessory 
genes, i.e., alleles or genes specialized in one niche (Shapiro et al. 2012). Simula- 
tions have shown that sympatric speciation is more likely when fewer loci are 
required for speciation and when recombination is reduced (Friedman et al. 2013). 
As two populations become more and more differentiated, the accumulation of 
substitutions is expected to reduce gene flow due to epistatic interference (Jain 
et al. 1999), similarly to Dobzhansky—Muller incompatibilities. Indeed, many loci 
of the genome coevolve together, and, for instance, central protein complexes such 
as translation, transcription, and replication complexes require interaction between 
many central proteins that coevolved together, which could explain why these genes 
are rarely exchanged by HGT across species, i.e., the “complexity hypothesis” (Jain 
et al. 1999). Such incompatibilities are expected to be most relevant when 
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populations have significantly diverged and most likely form barriers of gene flow 
when DNA originates from distant species. However, it is possible that those 
negatively selected epistatic interactions also contribute to the isolation of more 
recently diverged populations. 

Several studies have demonstrated that the impact of selection on bacterial 
genome evolution depends on the relative prevalence of selection (s) and recombi- 
nation rate (r) in sympatric evolution (Shapiro et al. 2009; Friedman et al. 2013; Polz 
et al. 2013). When selection is much stronger than recombination (7/s << 1), the 
selected allele will lead to the fixation of the entire genotype through genome sweep. 
The resulting process will be similar to the periodic selection predicted by the SEM. 
On the other hand, alleles with lower selective coefficients relative to recombination 
(r/s >> 1) are expected to evolve by gene/allele sweep. In this case, selection will be 
unable to lead to speciation as the selected allele will be exchanged between the 
population’s genotypes by gene sweep. Several studies have attempted to determine 
whether prokaryotic populations evolve primarily through gene or genome sweeps 
and, so far, evidence overwhelmingly suggests that gene sweeps are more frequent 
than genome sweeps (a single case of genome sweep against ~35 cases of gene 
sweeps (Simmons et al. 2008; Croucher et al. 2011; Shapiro et al. 2012; Cadillot- 
Quiroz et al. 2012; Bendall et al. 2016; Bao et al. 2016; Porter et al. 2017)). The large 
prevalence of gene sweeps over genome sweeps is somewhat surprising considering 
that prokaryotes, as asexual organisms, are thought to display modest rates of gene 
flow (Wiedenbeck and Cohan 2011). It is, however, difficult to clearly quantify the 
impact of gene flow on genome evolution (Bobay et al. 2015) and a recent exper- 
imental evolution study has shown that gene flow can even lead to the extinction of 
beneficial alleles (Maddamsetti and Lenski 2018). It is possible that additional 
factors counteract genome sweeps, such as clonal interference (Lieberman et al. 
2014; Maddamsetti et al. 2015) and negative frequency-dependent selection 
(Cordero and Polz 2014; Takeuchi et al. 2015). 


Introgression and HGT from External Species In comparison to the processes 
acting in sexual organisms, occasional gene flow from external bacteria could be seen 
as a form of introgression. It has been noted that introgression can sometimes present 
a source of adaptive alleles in sexual organisms and those transfers can even lead to 
hybrid speciation (Mallet 2007; Rieseberg 1997; Seehausen 2004; Keller et al. 2013). 
The importance of these processes remains to be explored in prokaryotes. A study 
comparing the evolution of two Campylobacter species—C. jejuni and C. coli—can 
be viewed as evidence of bacterial introgression (Sheppard et al. 2008, 2013). 
Although these results might lead to the complete *despeciation" of the two lineages, 
it should be noted that the transfer of DNA is asymmetric where one clade of C. coli 
has likely gained alleles from C. jejuni but other clades of C. coli did not. Interest- 
ingly, this case of bacterial introgression appears ecologically-driven based on recent 
niche overlap (Sheppard et al. 2008). It is, therefore, possible that introgression can 
result in the same outcomes in prokaryotes, such as hybrid speciation (Shapiro et al. 
2016). 
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Similar to introgression, the gain of new genes from distinct species by HGT 
offers another means to colonize new niches through ecologically-driven adaptation. 
The acquisition of antibiotic-resistant genes constitutes a well-documented case, but 
many other examples have been reported (Ochman et al. 2000; Popa and Dagan 
2011). It has been shown that HGT—rather than duplication—plays a predominant 
role in introducing new paralogs in the pangenome of prokaryotic species (Treangen 
and Rocha 2011), although these genes frequently come from related species due to 
genetic incompatibilities (i.e., gene promoters/regulators and codon usage bias) 
(Sorek et al. 2007; Popa et al. 2017). These acquired genes can mediate the 
colonization of new niches and can potentially lead to ecology-driven speciation. 
However, as noted above, accessory genes are not stably associated with a given 
genotype and tend to be frequently exchanged across strains of a given species 
(Schubert et al. 2009), indicating that they do not necessarily drive the formation of 
distinct ecologically specialized entities (Shapiro and Polz 2015). 


Summary Across the many forces that can affect speciation, it should be noted that 
neutral processes such as population dynamics and sequence divergence are unlikely 
to lead to speciation in bacteria, and that selection seems to be a necessary force by 
initiating and maintaining speciation. Selection in bacteria can act through two 
predominant avenues: (i) by driving ecological adaptation to different niches fol- 
lowing, for instance, the gain of new genetic material and (ii) by preventing gene 
flow between populations due to the presence of genetic incompatibilities, such as 
different RM systems, vector specificity, or negative epistasis. Other factors such as 
population dynamics and geographic range have been found to have an impact on 
speciation, although their relative contribution remains to be precisely deciphered. 
Overall, a BSC-based speciation model in prokaryotes would also rely on ecological 
processes and selection, as hypothesized by the SEM. However, one major differ- 
ence with the SEM is that a BSC-based model of prokaryotic speciation predicts that 
speciation events can be driven by genetic incompatibilities and need not be sys- 
tematically adaptive and ecologically-driven. 


4 Species Borders and Pangenome Borders 


Pangenome and Species Definitions The definition of species has direct conse- 
quences regarding the definition of pangenomes. If bacterial species are defined 
based on inconsistent criteria, it is not possible to compare the size of the pangenome 
across species and lineages. The case of Prochlorococcus illustrates this issue 
particularly well. Prochlorococcus is often studied as a single entity since it consti- 
tutes a single species based on 16S rRNA thresholds but multiple species based on 
ANI thresholds. The pangenome of Prochlorococcus has been estimated to reach the 
impressive amount of ~75,000 genes (Kashtan et al. 2014), although this would 
include strains that present less than 70% ANI, and this entity would actually 
correspond to multiple species and even genera. This issue likely affects many 
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pangenome analyses considering that public databases frequently contain 
misclassified species and species classified based on inconsistent methods (Martiny 
et al. 2006; Comas et al. 2009; Trost et al. 2010). Studies focusing on the evolution 
of bacterial pangenomes should be based on rigorous species delimitation, since the 
misclassification of a single genome can lead to dramatic overestimates or underes- 
timates of the size of a species’ pangenome. 

Species delimitation is not the only concern when analyzing pangenomes. The 
number of genomes sampled for each species obviously impacts pangenome esti- 
mates, since pangenomes necessarily increase in size as more genomes are included. 
It is possible to test if pangenome size reaches a plateau by performing resampling 
analyses, which would indicate that a sufficient number of genomes have been 
sampled to estimate the true pangenome size of the analyzed species (Tettelin 
et al. 2005; Lapierre and Gogarten 2009). Alternatively, it is possible to apply 
resampling analyses or to correct these metrics to account for uneven sampling 
biases across species (Bobay and Ochman 2018b). Biases in species sampling are a 
common issue for many genomic analyses and several methods have been developed 
as an attempt to address this shortcoming (Lapierre et al. 2016). However, the most 
efficient solution remains to increase sample sizes, and, more importantly, to limit 
biases when collecting samples, but this last consideration is often in conflict with 
study designs focusing on medically- or environmentally-relevant strains. 


Cohesion of Core- and Pangenomes The goal of a species definition is to identify 
cohesive ensembles of evolutionary lineages. The ideal species definition would 
succeed in identifying genetically and ecologically cohesive units. Although genetic 
cohesion is easier to assess than ecological cohesion for bacteria, the genetic 
homogeneity of a group of organisms can be evaluated through different lenses. 
Firstly, because the core-genome constitutes the backbone of genes shared by all 
members of the species, these genes are more readily used to infer evolutionary 
relatedness and other metrics. Moreover, despite gene flow, core-genomes have 
conserved the phylogenetic signal of the vertical inheritance of bacterial taxa 
(Touchon et al. 2009; Abby et al. 2012). Nearly all genome-based species defini- 
tions—i.e., ANI, phylogenetic methods, and BSC-like—rely exclusively on the 
cohesion of the core-genome. The pangenome potentially offers an alternative 
measure of the genetic cohesion of species, since conspecific strains are expected 
to share more similar gene repertoires than strains belonging to distinct species. It is 
currently difficult to assess the pangenome cohesion of a species considering that 
accessory genes tend to be found at low frequency within species and this would 
require deep genome sampling, although more and more bacterial species have now 
hundreds or thousands of sequenced genomes. More analyses need to be performed 
to understand the specificity of pangenomes, especially in relation to closely related 
lineages and ecologically or geographically overlapping species. 

Gene flow can define biological species based on DNA exchange along the core- 
genome but, so far, this method has been ignoring the patterns of HGT of the 
pangenome. The core- and pangenomes are two complementary metrics that can 
be used to infer the cohesion of species and some recent results obtained in two 
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bacterial phyla suggest that core- and pangenomes present the same phylogenetic 
signal, implying that both can be reliable for inferring species borders (Wright and 
Baum 2018). In fact, a recent method has proposed a first attempt to delimitate 
species based on pangenome cohesion (Moldovan and Gelfand 2018), which opens 
promising possibilities to include pangenome cohesion into species delimitation. 
More work needs to be done in order to finely understand the evolutionary dynamics 
of the pangenome itself. For instance, the dynamics of the pangenome is likely 
affected by the ability of a given species to engage in gene flow, as suggested by a 
study showing that clonal species are unlikely to present a large pangenome, since 
their pangenome primarily evolves through gene loss (Bolotin and Hershberg 2015). 
Bacterial species can also gain new genes from external lineages and the extent of 
segregation of the pangenome remains poorly understood. The accumulation of 
genomic data should soon allow more accurate analysis of the dynamics of the 
pangenome and this will open new avenues for evaluating the genetic cohesion of 
prokaryotic species. 


5 Drift-Barrier Model for Pangenome Evolution 


A BSC-based species definition is particularly relevant for studying population 
genetics in prokaryotic organisms. Several parameters such as recombination rate, 
effective population size (Ne), or pangenome size are metrics that are typically 
inferred at the species level. In particular, Ne has strong implications regarding the 
relative impact of selection and drift acting on a given species. High Ne populations 
are less sensitive to drift and can efficiently purge deleterious sequences, whereas 
low Ne populations, on the other hand, will not be as effective at purging deleterious 
mutations. A trait conferred by a given variant would primarily evolve through drift 
(i.e., neutrally) when |2.Ne.sl| << 1, while selection will be effective when |2.Ne. 
sl >> 1, where s represents the selection coefficient of a given sequence or variant 
(Kimura 1968). For these reasons, it is believed that more complex organisms such 
as mammals, which have low Ne, present larger genomes due to the accumulation of 
"junk DNA" through drift (1.e., the Mutational Hazard Hypothesis) (Lynch and 
Conery 2003; Lynch et al. 2011). Because these organisms display small population 
sizes, selection is not as efficient at purging slightly deleterious sequences, such as 
noncoding DNA, introns, and mobile elements. 

In contrast to many eukaryotes, bacterial genomes are small and compact and 
because microbes present much larger population sizes, this seems in perfect agree- 
ment with the expectation of the Mutational Hazard hypothesis. The genomic 
compactness of bacteria has been ascribed to a strong bias toward deletion in these 
organisms (Mira et al. 2001; Andersson and Andersson 2001). However, several 
studies have observed that, across bacteria, genome size appears positively corre- 
lated with Ne (Daubin and Moran 2004; Kuo et al. 2009; Novichkov et al. 2009). 
Free-living bacteria frequently possess relatively large genomes (typically >3 Mb), 
while obligate endosymbionts—with low Ne—have smaller genomes (frequently 
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<1 Mb) (Moran and Plague 2004). Yet, some marine bacteria, which are thought to 
reach gigantic population sizes, also present streamlined genomes (Giovannoni et al. 
2005, 2014). In particular, Prochlorococcus and Pelagibacter ubique have small 
genomes (-1 Mb), although they might be among the most abundant cellular 
organisms on earth (Batut et al. 2014). Therefore, the relationship between Ne and 
genome size appears to be more complex in bacteria. 

One key difference between bacteria and higher eukaryotes is the very low 
amount of noncoding DNA, introns and mobile elements found in most bacterial 
genomes. In prokaryotes, variations in genome size are primarily driven by the 
presence of different amounts of accessory genes. Accessory genes are assumed to 
be functional and beneficial to the cell and recent modelling work suggests that 
virtually all genes in prokaryotic genomes are expected to be beneficial (Sela et al. 
2016). Because the diversity of accessory genes is a direct function of pangenome 
size, this opens the possibility that Ne may drive the evolution of pangenome size 
rather than average genome size in prokaryotes. In support to this hypothesis, clear 
correlations between Ne and pangenome size have been observed across a dataset of 
153 species, whose borders have been defined based on the BSC under a unified 
framework (Bobay and Ochman 2018b). Other recent studies have also reported 
similar trends (Mcinerney et al. 2017; Andreani et al. 2017). 

Based on these observations, we have recently proposed that bacterial 
pangenomes could be driven by Drift-Barrier evolution (Bobay and Ochman 
2018b). The Drift-Barrier model has originally been developed to account for the 
variation in mutation rates across organisms (Sung et al. 2012; Lynch et al. 2016). 
Under a Drift-Barrier model, pangenome size is expected to be a function of Ne 
because only the most beneficial accessory genes would be conserved by selection in 
small Ne species, while species with large Ne would be able to conserve accessory 


Selection coefficient s 


Fig. 1 Drift-Barrier model of pangenome evolution. Each large circle represents a pangenome and 
small circles represent individual genes. Color gradient reflects the selective coefficient of the genes. 
Species with large effective population size Ne are less subject to drift and can retain genes of small 
beneficial value (left). As Ne decreases, additional genes of small fitness benefit will be perceived as 
effectively neutral and will be lost by drift (center). Under strong levels of drift, as expected in small 
Ne species, only the most beneficial genes will be conserved by selection, and this will result in 
small pangenomes mostly composed of core/housekeeping genes (right) 
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genes with modest fitness contribution (Fig. 1). As supported by multiple studies, 
deleterious and neutral sequences are expected to be quickly purged from microbial 
genomes (Mira et al. 2001; Andersson and Andersson 2001). Our model assumes 
that virtually every gene of the pangenome is beneficial (positive selection coeffi- 
cient: s > 0). Even if beneficial, an accessory gene is expected to be retained by 
selection only if it is perceived as effectively beneficial. In other words, an accessory 
gene will be conserved when 2.Ne.s >> 1, while genes that appear effectively 
neutral (2.Ne.s << 1) are expected to be lost by drift. This implies that high Ne 
species are expected to retain a larger pool of genes including many accessory genes 
with modest fitness contribution, whereas low Ne species can only conserve the most 
beneficial genes (high s), i.e., mostly essential and/or core genes. Although new 
genes can be introduced into a species' pangenome by HGT, those accessory genes 
with low selective coefficient will be lost by drift. 


6 Outlook 


Many aspects of bacterial biology are now better understood but building a 
biologically-relevant microbial species concept remains challenging. Because pro- 
karyotic organisms are microscopic, their population dynamics, ecological interac- 
tions, and speciation mechanisms are still difficult to decipher. Many aspects of the 
population processes driving microbial evolution have not been characterized. 
Habitat structure—and its temporal variations—of prokaryotic species is still for 
the large part mysterious. Similarly, microbial ecology and its impact on population 
dynamics remain tedious to describe in depth. Defining clear microbial niches is 
problematic practically and conceptually and little is known about microbial ecology 
compared to the vast collection of genomic data now available. The recent devel- 
opment of reverse ecology approaches opens a new route to gain knowledge about 
microbial ecology. 

The accumulation of genomic data has profoundly impacted our vision of speci- 
ation in prokaryotic organisms. Several results suggest that prokaryotic species are 
definable and diagnosable as genetically cohesive as evidenced by the existence of a 
core-genome. However, the evolution of the core-genome remains to be fully under- 
stood. It is becoming possible to analyze the evolution of species- and genus-specific 
core-genomes over relatively short evolutionary time scales by comparing related 
species when sufficient genomic data is available (Touchon et al. 2014). On the other 
hand, the vast diversity of microbial pangenomes emphasizes the versatility of 
bacterial species. Much larger data sets are needed to accurately understand the 
dynamics of bacterial pangenomes, but several species now have thousands of 
sequenced genomes available. Deciphering the evolution of the pangenome will be 
highly insightful for our understanding of the dynamics and the genomic cohesion of 
microbial species. 

From the original view of bacteria as purely clonal organisms, more and more 
evidence indicate that gene flow and HGT are key players in the evolution of most 
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bacteria, and potentially act as major contributors to bacterial speciation. Computa- 
tional approaches are needed to finely characterize gene flow in order to understand 
how networks of DNA routes can drive genomic cohesion and division in microbial 
species. Integrating these different aspects of bacterial biology will contribute to a 
more comprehensive prokaryotic species concept. 
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Abstract The stunning ability of bacteria to evolve and adapt has contributed to the 
success of these single cells, which have inhabited the Earth for billions of years and 
play vital roles in the environment and in human health. The goal of this chapter is to 
present and discuss the population-level organizational scheme of bacterial 
pangenomes, wherein genes are distributed among the strains of a species, such 
that each individual strain encodes only a subset of the genes available at the 
population level. Genes from the accessory/distributed genome (those present only 
in a subset of strains within a species) impart diverse functions or variations on a 
conserved function to strains. Moreover, horizontal gene transfer generates novel 
gene combinations. The maintenance and spread of any given gene arrangement are 
influenced by fitness. Further, the extent of genomic plasticity is regulated by 
restriction modification systems, phage-defense systems, and Clustered Regularly 
Interspaced Short Palindromic Repeats (CRISPR)—associated proteins (CRISPR- 
Cas). The combination of a pangenome structure and genomic plasticity reveals a 
successful strategy for bacterial adaptation to ever-changing environments. From a 
clinical perspective, pangenome analyses inform the selection of therapeutic targets, 
designed to focus either on an entire species or on virulence features within a species. 
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Further, they provide a framework for modeling the efficacy of drugs and vaccines. 
In summary, following the explosion in sequencing technology, pangenome studies 
have revealed remarkable genomic organizations at the levels of species, with 
important implications to our understanding of evolution, and our ability to design 
therapeutics and predict their long-term outcomes. 


Keywords Pangenome - Genomic diversity - Genomic plasticity - Horizontal gene 
transfer 


1 Introduction 


Bacteria dominate our planet and can be traced back to billions of years in the 
geological record. They play critical roles in shaping our habitat, from adding 
oxygen to the atmosphere to fixing nitrogen in the soil. They also play a vital role 
in human health, with commensal/mutualistic bacteria influencing nutrition and 
immunity, and pathogenic bacteria causing diseases from epidemics like the Black 
Death of medieval times to modern-day chronic biofilm infections resulting in the 
spread of antibiotic resistance. A defining characteristic of bacteria in both the 
environment and health is their ability to rapidly evolve and adapt. Here we discuss 
the elegant population-level organizational scheme that bacterial species use wherein 
their genomes are distributed among large numbers of strains, with no single strain 
having more than a small minority of genes available at the population level. This 
distributed pan(supra)-genome provides for adaptation to countless novel challenges 
and environmental niches. 

Individual bacterial genomes have a discrete number of genes. However, enor- 
mous differences in gene content exist even among the genomes of strains of a single 
species. Therefore, the gene content of a single strain is less than the full complement 
of different genes from all strains. The comprehensive set of genes within a species, 
i.e., all genes from all strains, is defined as the pangenome (or supragenome). The 
pangenome is organized into the core genome, which corresponds to the set of genes 
conserved across all strains in the species, and the accessory genome (or distributed 
genome), which are all noncore genes. We compiled pangenome papers from 
PubMed, identifying 295 species-specific pangenome projects performed on approx- 
imately 70 genera (Fig. 1). In all of these projects, the pangenome was found to be 
substantially larger than the core genome (Fig. 2). 

The diversity within a species’ pangenome provides a reservoir of genetic 
material available to bacterial cells to respond to selective pressures. Horizontal 
gene transfer (HGT) is the process by which individual bacterial cells can uptake 
genetic material from their environment or neighboring bacteria and generate novel, 
strain-specific gene combinations. It seems logical that when HGT occurs among 
strains of the same species these events are more likely to be adaptive or work in 
concert within the biological network, when compared to random mutations or genes 
acquired from distantly related species. This has been demonstrated to be the case in 
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multiple species, where the majority of accessory genes appear to be evolving in 
tandem with the core genome (Gladitz et al. 2005). In this manner, the pangenome 
allows a species to incorporate more solutions to environmental stresses and niches 
than can be encoded by a single strain (Ehrlich et al. 2005, 2010). 


2 Steps in the Assembly of a Pangenome 


Pangenome analyses are performed on a set of strains from the same species, or very 
closely related species (often different species grouped together by genus, though we 
will not be examining those projects here). The set of all coding sequences (CDS) are 
clustered by sequence similarity with the objective of generating groups of 
orthologous genes. This is a multistep process that begins with whole-genome 
sequencing (WGS) of multiple independent bacterial (nonclonal, nonderivative) 
strains selected to represent the broadest geographic and phenotypic ranges of the 
species of interest. Following sequencing, the remaining steps are computational and 
include (1) assembly of genomes into contigs, (2) annotation of protein-coding 
sequences (CDS), and (3) clustering of CDSs based on the sequence similarity of 
nucleic acids or amino acids of their cognate encoded proteins. Once clusters are 
defined, they are classified based on strain prevalence into core or accessory (dis- 
tributed) clusters. The accessory/distributed set of gene clusters is often further 
organized into those that are widely distributed (near core/soft core) in a population 
and those that are rare (shell) or unique (Fig. 3). 
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Fig. 3 Histogram of the number of gene clusters present in a given number of genomes. Taken 
from a project examining 12 genomes of Moraxella catarrhalis (Davie et al. 2011), with a total of 
2383 gene clusters 
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Fig. 4 Frequency of reference to programs over the past 5 years in pangenome publications 
(referenced at least 4 times) 


The tools and the parameters used to characterize gene clusters vary widely 
among projects (Fig. 4). Generally, the first project(s) within a species tend to 
focus on the basic characterization of the pangenome. Subsequent projects often 
emphasize specific areas of interest, such as the distribution of virulence factors, 
levels of horizontal gene transfer, or epigenetic factors. Our survey of 
295 pangenome projects did not reveal a strong preference for any individual 
assembly program. This is likely because assembly programs and versions perform 
differently depending on the examined species and the employed DNA sequencing 
technology. Further, many pangenome projects utilize pre-assembled genomes from 
publicly available databases (GenBank, EMBL, DDJB, JGI, PubMLST, etc.). This 
survey found that the CD-HIT program was the most frequently used gene clustering 
software, though a diverse set of other programs were also utilized for this purpose. 
Finally, commonly used software for other analyses include gene annotation (RAST, 
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Prokka, PHAST, and Prodigal) (Aziz et al. 2008; Seemann 2014; Zhou et al. 2011; 
Hyatt et al. 2010), genome/gene alignments (Muscle, Mauve, Mega, and ClustalW) 
(Edgar 2004; Darling et al. 2004; Kumar et al. 1994; Higgins and Sharp 1988), and 
phylogenetic tree building (Mega, RAxML, and PhyML) (Kumar et al. 1994; 
Stamatakis 2006; Guindon et al. 2010). Overall, there is high variability in the 
methods/software used for pangenome analyses, reflecting diversity in the scope 
and goals of these projects. 


3 Size of the Pangenome 


The size of a species’ pangenome, relative to the size of the core genome, is highly 
variable across the eubacteria. In Fig. 2, we display the variability we encountered in 
295 species-specific pangenome projects (Figs. 1 and 2). Papers included in this 
summary span from 2005 [when the first pangenomes were described in 
S. agalactiae (Tettelin et al. 2005) and H. influenzae (Shen et al. 2005; Hogg et al. 
2007)] through 2018. In all cases, the pangenome was significantly larger than the 
set of genes in a given strain. The size of the core genomes ranged from «20 to 
>60% of the pangenome (Fig. 2). 

In some cases, calculations on the size of the pangenome may reflect inaccuracies 
in the current taxonomy, instead of the underlying biology. An instance of high 
genomic diversity is observed with Gardnerella vaginalis, where only 27% 
(746/2792) of its gene clusters are core (Ahmed et al. 2012). It is likely that 
G. vaginalis appears so genomically diverse because traditional biochemical tests 
used to identify strains within this taxa were unable to distinguish among the 
multiple genomically diverse species that are actually present. Thus, in this case, 
the apparent large size of the pangenome (and the corresponding small size of the 
core genome) arose from the unintentional merging of multiple species into a single 
species. In contrast, instances of low genomic diversity are observed in the genus 
Bacillus. Both Bacillus anthracis and Bacillus thuringiensis closely resemble 
B. cereus (Vilas-Béas et al. 2007). B. thuringiensis appears to correspond to multiple 
phylogenetic clades (lineages) within B. cereus. B. anthracis (a species with one of 
the smallest pangenomes) likely represents a single phylogenetic lineage within the 
broader, more diverse definition of B. cereus that acquired a clinically important set 
of toxin genes (Okinaka and Keim 2016; Hall et al. 2010). 

It is tempting to speculate that there are general principles that directly associate 
the size of the pangenome with the biology of the species. Factors that may play a 
substantial role are the extent of gene transfer, the degree of interactions with 
competing and cooperating species, the number of niches inhabited, or the lifestyle 
of the bacterium. The hypothesis that highly specialized environments lead to 
smaller genome sizes has been explored in the context of obligate intracellular 
species and pathogens (Merhej et al. 2009; Georgiades et al. 2011). A study of 
overall differences between the genomes of 12 highly pathogenic species compared 
to their most closely related nonpathogenic cousins found that, for the sets of 
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bacteria studied, the most virulent species generally had smaller genomes, which 
suggests gene loss as well as loss-of-function mutations (Georgiades and Raoult 
2011). The reduced genome size is hypothesized to be a consequence of extreme 
specialization of the pathogens to their hosts, while the less-specialized 
nonpathogens show greater levels of genomic variation due to selective pressure to 
remain competitive in more diverse environments (Georgiades and Raoult 2011). 
While this is an interesting idea, not all studies point to a relationship between 
pathogenicity and genome size (Bonar et al. 2018). 

In a related vein, longitudinal comparative genomic studies of pathogenic clonal 
lineages of Pseudomonas aeruginosa, Burkholderia sp., and Haemophilus 
influenzae have captured microevolution and host adaptation in the human lung 
(Rau et al. 2012; Lee et al. 2017; Pettigrew et al. 2018; Moleres et al. 2018; Bianconi 
et al. 2018; Burns et al. 2001; Li et al. 2005; Jorth et al. 2015; Silva et al. 2016). In 
many cases, these changes reveal gene deletions when compared to their anteced- 
ents. For instance, serial isolates of H. influenzae clonal lineages in COPD patients 
display a significant association with loss-of-function mutations in the ompP1 
(fadL) accessory gene. fadL is beneficial to this bacterium in early infection, as it 
promotes adhesion and intracellular invasion via interactions with the epithelial cell 
ligand hCEACAMI (human carcinoembryonic antigen-related cell adhesion mole- 
cule 1). In contrast, it may hinder long-term survival in the lung, as its expression 
increases sensitivity to arachidonic acid, an exogenous mammalian long-chain fatty 
acid with bactericidal effects (Moleres et al. 2018). This is indicative of selective 
pressure in favor of ompP1 function in the nasopharynx and against its function in 
the lungs. These observations support the general concept that gene loss may 
accompany the ability to survive within highly circumscribed niches (Rau et al. 
2012; Lee et al. 2017; Pettigrew et al. 2018; Moleres et al. 2018). Nonetheless, one 
must keep in mind that evolution in niches that do not support transmission may not 
be relevant to the evolution of the pangenome. Large-scale comparative pangenome 
and evolutionary studies promise to reveal the rules that shape the overall 
pangenome size, as well as identify disease and tissue-specific genes (and gene 
losses). 


4 The Accessory Genome and Functional Diversity 


In general, core genomes are enriched for housekeeping functions. These include 
energy production, amino acid metabolism, nucleotide metabolism, lipid transport, 
and translational machinery. Accessory genomes often encode genes involved in 
protein trafficking and defense, as well as many niche-specific functions. Further, 
plasmids, phage, and transposons are also often associated with accessory genomes. 
This section focuses on functional diversity as it pertains to the accessory genome. 

Phenotypic traits can result from a blend of core genes with highly variable 
accessory genes. This is exemplified by the production of the capsule (Swartley 
et al. 1997; Bentley et al. 2006), synthesis of the extracellular polymeric substance 
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(EPS) (Harris et al. 2017), and modification of the cell wall (Gerlach et al. 2018). 
Here, conserved modules encoded in the core and softcore genomes are modified by 
components encoded by the accessory genome, providing a procedure to generate 
phenotypic variability. In Neisseria meningitidis, capsule biosynthesis genes are 
encoded within a single syntenic cps chromosomal region, which encodes both 
core and accessory genes. Variations in the accessory genes yield diversity in capsular 
types (Harrison et al. 2013). In Lactobacillus salivarius, the EPS cluster 2 contributes 
to the biofilm matrix. The genes at the extremities of this multigene cluster genes are 
core, while there is extensive variation in the genes encoded in the center of the 
cluster. These differences in glycotransferases and EPS biosynthesis-related proteins 
contribute to variations in the EPS structure (Harris et al. 2017). Yet another example 
is observed in methicillin-resistant Staphylococcus aureus (MRSA), where strains 
evade host immunity by modification of wall teichoic acid (WTA) using an alterna- 
tive WTA glycosyltransferase encoded on a prophage (Gerlach et al. 2018). These 
studies exemplify how diversity within the accessory genome can provide bacteria 
with a blueprint to generate variability. This genomic flexibility is likely to increase 
the adaptive potential of bacterial species in the face of environmental stresses. 
Genes encoded by the accessory genome can influence pathogenic potential. A 
well-studied example is Escherichia coli; this species encodes a highly diverse 
pangenome, where variability within the accessory genome leads to strains that 
differ in their ability to colonize human cell types and to trigger pathogenicity 
(Rasko et al. 2008). E. coli strains are grouped into pathovars based on the presence 
of virulence markers, often encoded on mobile elements (Kaper et al. 2004). Whole- 
genome comparative analyses of pathovars demonstrate that strains of the same 
pathovar are not always phylogenetically clustered (Rasko et al. 2008; Salipante 
et al. 2015; Hazen et al. 2013). This pattern of clustering is consistent with the 
transfer of accessory genes among E. coli strains, as well as the independent 
acquisition of virulence traits by strains in the same pathovar. One prominent 
example of HGT among E. coli strains of different pathovars is observed in the 
highly pathogenic strain that caused the 2011 German food poisoning outbreak 
(Mahan et al. 2013). Multiple genomic studies ultimately concluded that the out- 
break was caused by a Shiga toxin-producing E. coli (STEC) of serotype O104:H4, 
which harbored multiple genes commonly associated with enteroaggregative E. coli 
(EAEC) including: a plasmid-encoded type I aggregative adherence fimbriae that 
mediate colonization and biofilm formation, assortment of serine proteases 
(SPATEs), and chromosomally encoded Shigella enterotoxin 1 (Askar et al. 2011; 
Mellmann et al. 2011; Rasko et al. 2011). Moreover, the prevalence of genetic 
transfer among E. coli strains is highlighted by the lack of an exclusive genomic 
signature among commensal E. coli strains. The strains that asymptomatically 
colonize the human gastrointestinal tract are genetically diverse (Rasko et al. 
2008). These commensal strains may serve as genetic repositories for virulence 
determinants and, in addition, gene transfer events may modify their pathogenic 
potential and drug sensitivity. In conclusion, the accessory genome of E. coli is a 
critical determinant of tissue tropism, pathogenic potential, and clinical presentation. 
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Non-orthologous accessory genes with related functions are often syntenic across 
strains. We propose that this genomic configuration allows one variant to be 
switched by another in the process of recombination, where the neighboring genes 
provide an anchor for homologous recombination. One example is the genomic 
region that encodes the DpnI, DpnII, or the DpnIII type II restriction enzymes in 
S. pneumoniae. These loci differ in the sequence of the enzymes, the number of 
genes in the locus, and their ability to restrict phages or transforming DNA (Johnston 
et al. 2013a; Eutsey et al. 2015). Another example is the genomic region that encodes 
bacteriocins downstream of the bip histidine kinase signal transduction system in 
S. pneumoniae. While the genes in this region are predicted to be bacteriocins, the 
number of genes, their sequence, and the cells they target differ across strains (Lux 
et al. 2007; Dawid et al. 2007; Valente et al. 2016; Rezaei Javan et al. 2018). Other 
examples of this proposed mechanism, wherein conserved flanking genes anchor 
multiple variants of pathogenicity genes, include the parologous vHiSLR genes of 
H. influenzae (Kress-Bennett et al. 2016) and the bro gene variants of Moraxella 
catarrhalis (Earl et al. 2016). Syntenic regions that encode non-homologous genes 
within a single functional class may provide a pangenomic "switch," allowing cells 
to flip between variants of a single function to optimize fitness in diverse niches. 

In summary, many of the genes in the accessory genome provide new functions or 
variations on a conserved function in a manner that expands the ability of strains to 
survive or adapt in their environments. In this manner, the strain diversity resulting 
from variations in the accessory genome may serve as a population-level tool to 
ensure the survival of a bacterial species. 


5 Pangenome Plasticity 


Speaking teleologically, via intra- and inter-species gene transfer, individual bacte- 
rial strains can draw from an expanded set of genes for their own adaptation and 
evolutionary success. This phenomenon was observed as early as 1928 in the 
Griffith's experiment, where a nonencapsulated strain of S. pneumoniae integrated 
DNA from an encapsulated isolate, leading to its conversion from avirulent to 
virulent (Griffith 1928). Almost a century later, the bacterial research community 
has described multitudinous instances of gene transfer among bacterial strains. 


5.1 Gene Transfer Events Within and Across Species 


Gene transfer events can occur anywhere, and our literature review identified 
19 manuscripts that describe bacterial in vivo gene transfer within human patients 
(Table 1). A common theme is the acquisition of antibiotic resistance; particularly in 
regard to carbapenems, B-lactamases, and quinolones. Resistance was commonly the 
result of genes acquired via bacteriophages, plasmids, or pathogenicity islands 
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Table 1 Summary of studies on in vivo recombination 
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Bacterial species 


Citation 


Mechanism of transfer 


Consequences and 
disease state 


Acinetobacter Agodi et al. (2006) | Class 1 integrons ICU-acquired pneumonia 

baumannii multiresistant antibiotype 

Enterobacteriaceae | Hammerum et al. Plasmid Meropenem resistance 
(2016) 

Enterobacteriaceae | Datta et al. (2017) | Plasmid transfer of Septicemia 


blaNDM-1 


Enterobacter clo- 


Sidjabat et al. 


Transfer of blaIMP-4 


Meropenem resistance 


acaelEscherichia (2014) 

coli 

Enterobacter Neuwirth et al. Plasmid transfer- Multidrug resistance 

aerogenes (2001) encoding ESBL TEM-24 

Escherichia coli Soto etal. (2011) | Pathogenicity island Male UTI recurrence 
acquisition 

Escherichia coli Schjgrring et al. Bacteriophage Diarrhea and hemolytic 


Escherichia coli 


(2008), 
Bielaszewska et al. 
(2007) 


Gumpert et al. 


Conjugative antibiotic 


uremic syndrome, 
gastroenteritis 


Antibiotic resistance 


(2017) resistance plasmid 

Haemophilus Moleres et al. Selective loss-of-func- Loss of function- 

influenzae (2018) tion pressure resistance to bactericidal 
fatty acids Acute COPD 
exacerbations 

Klebsiella Mena et al. (2006) | Insertion sequence Extended-spectrum beta- 

pneumoniae (IS26) lactamase-producing 
species carbapenem 
resistance 

Klebsiella Góttig et al. (2015) | Transconjugation of Carbapenem resistance 

pneumoniael plasmid/transposon 

Escherichia coli 

Klebsiella Gona et al. (2014) | Mobile genetic elements | Carbapenem-resistant 

pneumoniael carrying blaKPC, patients developed 


Escherichia coli 


conjugative plasmids 


bloodstream infections 


Legionella McAdam et al. Genomic island carrying | Legionnaires’ disease/ 

pneumophila (2014) T4SS community-acquired 
pneumonia-T4SS associ- 
ated with more severe 
symptoms 

Neisseria Brynildsrud et al. Genomic islands, bacte- | NmC meningitis 

meningitidis (2018) riophage (MDAphi) 

Serratia Mata et al. (2010) | Plasmid mediated AmpC beta-lactamase, 

marcescens! quinolone resistance 

Escheric hia coli 

Staphylococcus Hurdle et al. Conjugative replicon Mupirocin resistance- 

aureus/ (2005) persistent carrier of 

epidermidis MRSA 
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Consequences and 


Bacterial species Citation Mechanism of transfer disease state 
Staphylococcus Moore and Multiple mobile ele- Hospital MSSA 
aureus Lindsay (2001) ments, specifically 

phages 
Staphylococcus Stanczak-Mrozek | Bacteriophages and Antibiotic-resistant 
aureus et al. (2015) plasmids (general MRSA 

transduction) 
Staphylococcus Langhanki et al. Mobile elements (geno- | Long-term persistence 


aureus 


(2018) 


mic island, pathogenicity 


cystic fibrosis patients 


islands, bacteriophages), 
transduction 


(Conlan et al. 2014; Bielaszewska et al. 2007; Datta et al. 2017; Feld et al. 2008; 
Langhanki et al. 2018; Mena et al. 2006; Neuwirth et al. 2001; Soto et al. 2011). In 
our set, five cases show HGT between different bacterial species: Serratia 
marcescens and Escherichia coli (Mata et al. 2010), two instances of Klebsiella 
pneumoniae and E. coli (Gona et al. 2014; Góttig et al. 2015), Staphylococcus 
aureus and Staphylococcus epidermidis (Hurdle et al. 2005), and Enterobacter 
cloacae and E. coli (Sidjabat et al. 2014). These studies highlight how bacteria 
occupying the same niche can evolve during the infectious disease process, posing 
new challenges for treatment. 

Cross-species transfer events introduce new genes into the species, thus 
expanding the pangenome. A prominent example is acquisition of the type 3 secre- 
tion system (T3SS) by multiple Gram-negative bacteria. The T3SS allows for the 
transport of effector proteins from the bacterial cytosol directly into the host cells 
(Hacker et al. 1997; Hueck 1998). In most cases, the genes encoding this injection 
system, and their effectors, have been acquired by HGT (Brown and Finlay 2011). 
These T3SS systems are critical components of virulence. For instance, in Salmo- 
nella, acquisition of the SPI1 T3SS enables the bacterium to invade host cells, while 
acquisition of the SPI2 T3SS enables it to escape host defenses and survive within 
host cells inside a protective vacuole (Jennings et al. 2017; Ochman et al. 1996). 
Another example of cross-species transfer has been observed in S. pneumoniae, 
where a multigene locus was acquired from Streptococcus suis (Antic et al. 2017). 
This locus was acquired exclusively by a phylogenetically distinct subset of strains 
within the S. pneumoniae species—a subset much more likely to infect the conjunc- 
tiva. The genes acquired from S. suis appear to contribute to the tissue tropism by 
promoting adherence to the ocular epithelium. Thus, expansion of the pangenome by 
gene acquisition from outside the species can contribute to bacterial virulence and 
tropism. 

Gene transfer among strains of the same species provides a mechanism to 
redistribute accessory/distributed genes within single strains. Studies on vaccine- 
escape strains of S. pneumoniae identified multiple genes acquired from a single 
donor (Golubchik et al. 2012). These recombination events ranged from 0.04 to 
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44 kb in size, and were located in various regions of the genome, including the 
capsular locus. Separate analyses of whole genomes of S. pneumoniae have captured 
multiple instances of serotype switches including from 23F to 3 and from 19F to 19A 
(Chewapreecha et al. 2014; Croucher et al. 2014a; Hiller et al. 2011). A current 
vaccine targets the 19F capsule, but not the 19A. Serotype 19F strains were widely 
prevalent pre-vaccine, while serotype 19A strains have spread in the USA during the 
post-vaccine era (Geno et al. 2015). This serotype switch has been observed in 
vaccinated and non-vaccinated populations. These observations are consistent with a 
model where HGT generates diverse genotypes, selective pressure from vaccines 
drives the spread of a subset of strains, and competition across strains shape the 
population and distribution of accessory genes. 

Studies that describe recombination among strains driven by natural competence 
and transformation suggest that multiple transfers may occur both simultaneously 
and sequentially between individual donors and recipient strains. A study on 
S. pneumoniae captured the progressive accumulation of recombinations in a set 
of six clinical strains isolated from a pediatric patient over a 7-month period. One 
strain incurred multiple recombination events from the same donor, over two 
instances of recombination. These events introduced recombinations at 23 sites, 
and led to the exchange of over 7% of the genome (Hiller et al. 2010). Similarly, a 
laboratory study in H. influenzae also captured multiple gene transfer events after a 
bout of recombination (Mell et al. 2011). For this study, DNA from a clinical strain 
was used to transform a laboratory strain. Transformants were observed to have 
multiple recombination events over the length of the chromosome, collectively 
corresponding to ~1-3% of the genome. These analyses not only demonstrate 
HGT events across strains, but also suggest that strains may display multiple trans- 
fers during a single competence event. 

HGT occurring through natural competence and transformation is unique among 
HGT mechanisms, in that it is driven by the recipient as opposed to by the donor (as is 
the case with mating and transduction). This means that it is an expressed phenotype 
that is triggered by the recipient cell. Thus, as amechanism of mutation and evolution, 
it is expressed when a cell is stressed and provides a genetic means to adapt to a 
stressful environment resulting in mutation-on-demand (Ehrlich et al. 2005). 


5.2 Constraints on Gene Transfer 


While there is clear evidence of HGT among strains of the same species, distributed 
genes are not randomly distributed within a species. Instead, they tend to be 
associated with specific lineages, suggesting that pangenome evolution operates 
with forces that promote as well as limit gene transfer (Croucher et al. 2014b), as 
discussed in the next paragraphs. 

There is increasing evidence that co-selection of genes limits gene transfer. A 
genome-wide study in S. pneumoniae demonstrated that a set of 876 loci, annotated 
to function in metabolism or transport, displayed a nonrandom distribution (Watkins 
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et al. 2015). The authors show that groups of coevolved genes (alleles) are adapted to 
particular metabolic niches. They predict that disruption of these groups of alleles, a 
process mediated by HGT, would lead to a drop in strain fitness. A computational 
approach applied to S. pneumoniae and N. meningitidis also uncovered co-selection 
of genes associated with drug resistance and virulence (Pensar et al. 2019). Genome 
architecture may also limit gene transfer. Many bacterial genomes encode short 
sequences that are enriched in close proximity to the replication terminus. The 
location of these sequences is under selection, such that HGT events that disrupt 
these elements impose a fitness cost (Hendrickson et al. 2018). Thus, allele 
co-selection and genomic architecture illustrate genome-wide features that, when 
disturbed, can result in loss of fitness and consequently restrict gene flow. 

In addition to factors that limit gene transfer via their influence on fitness, bacteria 
encode genes that serve as barriers to incoming DNA, such as restriction modifica- 
tion systems (RM), phage-defense systems, and Clustered Regularly Interspaced 
Short Palindromic Repeats (CRISPR)—associated proteins (CRISPR-Cas). Most 
RM and CRISPR-Cas systems exert their influence on double-stranded DNA. 
While DNA entering the cell by transformation is single stranded, these systems 
still appear to serve as barriers to transformation; a compelling model proposed that 
they do so via their activity on the transformed chromosome (Johnston et al. 2013b). 
Studies in N. meningitidis and S. pneumoniae illustrate the role of restriction 
modification (RM) systems in limiting HGT. Strains of N. meningitidis organize 
into distinct phylogenetic groups that are associated with the distribution of >20 RM 
systems (Budroni et al. 2011). This distribution is consistent with the hypothesis that 
the RM systems limit HGT among clades. Similarly, the PMENI pandemic lineage 
of S. pneumoniae displays asymmetric gene transfer. The heterologous gene transfer 
from PMENI to other strains is abundant, yet into PMENI is modest (Wyres et al. 
2012). The DpnIII RM system contributes to this structure, as it appears to limits 
HGT into PMENI strains, and is almost exclusively found in the PMENI lineage 
(Eutsey et al. 2015). Type I RM systems can also limit gene transfer, however, their 
architecture may allow rapid evolution of HGT barriers. The type I RM systems have 
a multifunctional component, where modification in one sequence can lead to both 
changes in methylation and endonuclease activity. This is in contrast to type II RM 
systems, where the protein that directs methylation is distinct from the protein that 
directs endonuclease activity, such that changes in specificity require mutations in 
more than one protein (Wilson and Murray 2002). In this manner, type I RM systems 
can rapidly evolve new specificities and generate diversity. A recent study in 
S. pneumoniae demonstrated that phase variation in the SpnIV phase-variable 
Type I RM limits acquisition of genomic islands by transformation (Kwun et al. 
2018). The work captures an instance of phase variation on a type I RM system that 
generated an HGT barrier between nearly identical strains. Together, these studies 
suggest that RM systems may foster genomic stability within subsets of strains. 

Many bacteria encode an abortive infection (Abi) system, which appears to be 
altruistic mechanism to protect the population at-large. When bacteria possessing an 
Abi system are infected by phage, the system is activated and triggers the death of 
the bacterial host. In this manner, death of the infected isolate avoids spread of the 
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phage across the bacterial community (Chopin et al. 2005). In an exciting twist, 
phage defense systems may also be encoded by prophage, illustrating cooperation 
between bacteria and phage to restrict unrelated phages (Dedrick et al. 2017; Bondy- 
Denomy et al. 2016). 

CRISPR-Cas confers adaptive immunity in prokaryotes and has the ability to 
inhibit conjugation, transduction and transformation. The CRISPR-Cas are com- 
posed of arrays of palindromic nucleotide repeats that are interspersed by short 
unique DNA segments called spacers, and cas genes. The spacers are acquired 
from foreign DNA, usually bacteriophages. Following acquisition, spacers are 
transcribed and processed into small CRISPR RNA (crRNA) molecules. A complex 
formed by Cas proteins and crRNA leads to the degradation of invading foreign 
nucleic acid, protecting cells from future invasion (Jiang and Doudna 2017; Adli 
2018). Many bacterial species and lineages are devoid of CRISPR-Cas systems. In 
vitro studies in multiple bacteria reveal an inverse correlation between HGT and the 
presence of a functional CRISPR-Cas system (Jiang et al. 2013; Watson et al. 2018). 
In Enterococcus faecalis, multidrug-resistant plasmids were observed in strains that 
lacked CRISPR-Cas systems, while the drug-sensitive strains encoded this system 
(Palmer and Gilmore 2010). Further, under selective pressure for the acquisition of 
antibiotic-resistant plasmids, Staphylococcus epidermidis strains acquired 
inactivating mutations in the CRISPR-Cas system (Jiang et al. 2013). These studies 
suggest that bacteria encounter a tradeoff: the fitness advantages associated with 
phage resistance afforded by CRISPR-Cas must be balanced against a decrease in 
genomic plasticity and the benefits conferred by acquisition of novel genes. None- 
theless, the role of phage protection systems in restricting gene flow is far from fully 
resolved. Some studies find contrasting results, and do not support the conclusion 
that CRISPR-Cas limits HGT. A large-scale computational study revealed that the 
activity of the CRISPR-Cas system was not associated with HGT events over long 
evolutionary timescales (Gophna et al. 2015). Further, a study in Pectobacterium 
atrosepticum suggests that CRISPR-Cas systems may actually contribute to HGT 
via their role in protecting bacteria against phage attack (Watson et al. 2018). Thus, 
more research is required to determine the ultimate influence of CRISPR-Cas 
systems on the genomic plasticity of bacterial populations. 

In conclusion, the set of genes in a species’ pangenome can expand via the 
introduction of genes from other species, rearrange across strains via an intra- 
species exchange, or vary with mutations. The shuffling of accessory genes and 
alleles generates new combinations that are subsequently subjected to the forces of 
selection on gene products and genome-wide features. Moreover, RMs, CRISPR- 
Cas, and phage-defense systems may also influence gene flow across strains and 
species. All factors combined, genomic plasticity emerges as a successful strategy 
for bacterial survival. 
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6 A Balance in the Accessory Genome 


A remarkable observation comes from recent mathematical models and population 
studies. Negative frequency-dependent selection may stabilize the proportion of 
individual accessory genes in a population of S. pneumoniae (Azarian et al. 2018; 
Corander et al. 2017). As expected, the authors observed that vaccination led to a 
dramatic drop in the representation of vaccine-sensitive strains. In doing so, the 
distribution of accessory genes within the population differed from that of the 
pre-vaccine population. Interestingly, over time, the frequency of the accessory 
genes trended toward that seen in the pre-vaccine population. These results suggest 
that the distribution of genes in the pneumococcal pangenome may have an equi- 
librium point. It remains to be determined whether similar patterns are observed in 
other species. The suggestion that the composition of pangenomes tends toward an 
equilibrium has important implications regarding our ability to predict the nature of 
replacement strains after the introduction of therapies that target subsets of strains 
within a bacterial population using a microbiome-sparing approach. 


7 Clinical Applications 


Pangenomic analyses can be utilized to identify potential therapeutic targets. Target 
specificity can be customized depending on the desired effect. The core genome can 
be used to target an entire species, as it contains genes possessed by every member of 
the species. Alternatively, targeting select members of the accessory genome, or the 
"microbiome-sparing" approach, will ensure that only strains containing the gene of 
interest are affected. Both strategies can be utilized to combat a wide variety of 
pathogens. 

Current efforts to combat pathogenic bacteria include targeting the bacterial 
capsule, a large polysaccharide layer that is a major virulence determinant with a 
key role in immune evasion. Strains vary in the composition of their capsules: those 
with identical capsules are placed in the same serotype, and those with highly similar 
capsules within a serogroup. For example, there are over 97 different serotypes 
known for S. pneumoniae that fall into 46 serogroups (Bentley et al. 2006; Geno 
et al. 2015; Tzeng et al. 2016), and over 12 serotypes for N. meningitidis (Harrison 
et al. 2013; Geno et al. 2015; Tzeng et al. 2016; Claus et al. 1997). New serotypes 
can arise by HGT, like in the movement of SiaD genes between N. meningitidis 
strains, or through mispairing during gene replication, which is responsible for 
serotypes 15 B/C in S. pneumoniae (Claus et al. 1997; van Selm et al. 2003). 
Capsular polysaccharide vaccines are available for S. pneumoniae, S. typhi, and 
N. meningitidis (Geno et al. 2015; Tzeng et al. 2016; Hessel et al. 1999). These 
specifically target the bacterial capsule, but young children (under the age of two) 
fail to create antibodies against these vaccines. To combat this, polysaccharide- 
protein conjugate vaccines were designed, which combine the polysaccharide 
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antigen with protein carriers and render them more immunogenic in young children 
(Finn 2004; Nair 2012; Szu et al. 1989; Lin et al. 2001). Development of conjugate 
vaccines faces major challenges, such as cost, host immune response, and bacterial 
structures (Nair 2012). Therefore, it would be ideal to create capsular polysaccharide 
vaccines with better immunogenicity. However, the structures of some capsule 
sugars are too similar to those found in mammalian tissues to be useful as polysac- 
charide vaccines. In these cases, vaccines could be designed to target virulence via 
accessory genes or to target these species as a whole via the core genome (Pichichero 
2017; Daniels et al. 2016; Chan et al. 2018). 

Using the accessory genome to create strain-specific drugs and vaccines has wide 
implications. For example, it is easy to imagine the creation of therapies against 
bacterial pathogens that are able to spare the larger microbiome. Commensal bacteria 
in the microbiome and pathogenic bacteria of the same species may share the same 
core genome, but can have vast differences in the content of their accessory 
genomes. If a therapy targets protein products from genes found only in the 
accessory genomes of pathogenic bacteria, it will not disturb the patient’s microflora 
as the commensal bacteria would lack the proteins the therapy is created against. 
This strategy has the potential to greatly improve patient health and recovery 
following a bacterial infection. 

Pangenomic studies can aid in the development of diagnostic tools. As with 
vaccines and drug development, accessory genes can be used to identify a particular 
strain/phenotype and core genes to identify a specific species. A study of 17 clinical 
isolates of G. vaginalis was used to propose the reclassification of G. vaginalis as a 
genus, based on the extent of pangenomic variation (Ahmed et al. 2012). Previously, 
metronidazole was used as a blanket antibiotic for the treatment of bacterial vagi- 
nosis. However, the understanding that metronidazole-resistant clades of 
G. vaginalis are actually different species creates room for the development of 
diagnostic tools to inform antibiotic treatment for patients with bacterial vaginosis 
(Balashov et al. 2014). Similarly, pangenomic studies among phenotypically diver- 
gent M. catarrhalis strains led to the characterization of a deep phylogenetic 
clade structure that separated the pathogenic sero-resistant strains from commensal 
sero-sensitive strains (Earl et al. 2016). In yet another example, Staphylococcus 
epidermidis was divided into two phylogenetic groups. One group included both 
commensals and pathogens, the other composed exclusively of commensal strains. 
Strains in the second group-encoded formate dehydrogenase, revealing a potential 
diagnostic marker (Conlan et al. 2012). A study in Helicobacter pylori identified 
lineage-specific genes; some have already been associated with acid resistance and 
virulence, and thus are potential targets to guide treatments (van Vliet 2017). 
Moreover, when studies associating pangenome and phenotype identify unannotated 
genes as diagnostic markers, they provide genetic fodder for linking new functions, 
distribution, and disease outcome (Ehrlich et al. 2010). One caution to consider in 
the development of diagnostics is that chronic infections can be caused by multiple 
strains of the same species, and analysis of a single strain could misdirect treatment. 

A crucial benefit of pangenomic analyses is their ability to determine the presence 
or absence of antibiotic-resistant markers. Prescription of an ineffective antibiotic is 
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both detrimental to patient’s health and adds to the problem of global antibiotic 
resistance. Some examples of pangenomic analyses to study the distribution and 
transmission of resistance genes have been performed on E. coli strains collected 
from wastewater treatment plants (Mahfouz et al. 2018), community-associated 
Clostridium difficile strains isolated from farm animals and humans (Knetsch et al. 
2018), and strains of Stenotrophomonas maltophilia collected from cystic fibrosis 
(CF) patients (Esposito et al. 2017). Given that related strains often differ in their 
drug resistance profile, probing the accessory genome for genes that encode drug 
resistance will be a critical component of personalized medicine. 

Genome-scale models (GEMs) of metabolism can provide great insight into the 
link between metabolism and pathogenesis. These network reconstructions provide 
context for the relationship between gene, gene product, and phenotype. 
Pangenomic analyses in three species observed that the majority of core genes are 
associated with metabolism (Cornejo et al. 2013; Bosi et al. 2016; Vieira et al. 2011). 
Pangenomic analysis of inflammatory bowel disease (IBD)-associated E. coli strains 
reported metabolic differences between IBD-associated strains and nonassociated 
strains, where the former set appeared to utilize energy more efficiently (Fang et al. 
2018). The differences in metabolic capabilities in disease and healthy states provide 
a promising place to explore diagnostic applications of the pangenome. Furthermore, 
the link between metabolism and virulence can be explored, and be used diagnos- 
tically to differentiate strains that cause mild or severe symptom presentation (Bosi 
et al. 2016). 

Beyond the use of pangenomic analyses to select targets for vaccines, therapeu- 
tics, and diagnosis, it has also served as an epidemiological tool. The origin of the 
2010 cholera outbreak in Haiti was traced using pangenomic analysis of Vibrio 
cholerae. Initially, it was unclear whether the epidemic originated with a local strain 
or Asian strain. A pangenomic analysis revealed that the epidemic was caused by 
strains originated in Southeast Asia (Reimer et al. 2011; Hendriksen et al. 2011; 
Chin et al. 2011; Mutreja et al. 2011; Orata et al. 2014; Hasan et al. 2012). Such 
epidemiological studies allow better strategic planning to avoid future epidemics. 


8 Conclusions 


The Distributed Genome Hypothesis provides both a historical and theoretical 
framework for understanding bacterial genomic plasticity, and puts it in the context 
of other classes of chronic pathogens (viruses and eukaryotic parasites) that have 
developed different mechanistic strategies for the generation of genetic diversity in 
situ. Viruses such as HIV-1 utilize an error-prone DNA polymerase (reverse tran- 
scriptase) to generate enormous diversity resulting in the development of a 
quasispecies within days of infection (Korber et al. 2001). Trypanosomes utilize a 
cassetting mechanism for antigen switching wherein they have an entire chromo- 
some of outer surface protein cassettes that they can exchange within the larger 
functional protein whenever the host adaptive immune response recognizes the 
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previous cassette (Horn 2014). Thus, within this context, we can view HGT of 
distributed genes among bacterial strains of a species as yet another means of 
“programmed” variation (Ehrlich et al. 2010). 


9 Perspectives 


The plasticity provided by the eubacterial pangenome may be driving the evolution 
of other domains of life. The rapid recombination of bacterial strains provided the 
evolutionary pressure for the development of the vertebrate adaptive immune sys- 
tem—which is mechanistically similar to what the bacteria are doing—it is essen- 
tially a random gene rearrangement phenomenon, very similar to HGT (Hu et al. 
2007). Lastly, as the variability in species becomes apparent, it triggers the question 
of how best to define a species. While pangenomic analyses do not offer the ultimate 
solution, they may provide a useful definition. Once the core genome of a species is 
defined, strains can be assigned, or not assigned, to a species based on the extent to 
which they share the same core genome (Nistico et al. 2014). 
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A Review of Pangenome Tools and Recent 2) 
Studies a 


G. S. Vernikos 


Abstract With the advance of sequencing technologies, the landscape of genomic 
analysis has been transformed, by moving from single strain to species (or even 
higher taxa)-wide genomic resolution, toward the direction of capturing the “totality” 
of life diversity; from this scientific advance and curiosity, the concept of 
“pangenome” was born. Herein we will review, from practical and technical imple- 
mentation, existing projects of pangenome analysis, with the aim of providing the 
reader with a snapshot of useful tools should they need to embark on such a 
pangenomic journey. 


Keywords Pangenome - Whole-genome - Exhaustive search - Subsampling - 
Regression function - Command line - Web-interface - Bayesian - Hidden Markov 
Models - Clustering - ORF alignment similarity - Combinatorial approach - Ortholog 
clusters - Reference pangenome - Finite supragenome model - Binomial mixture 
model - Infinitely many genes model - Gene presence/absence frequency 


1 Introduction 


Almost 15 years ago, Tettelin et al. (2005) conceived the concept of pangenome, in 
an attempt to describe and model the genomic totality of a taxa (species, serovar, 
phylum, kingdom, etc.) of interest. Since then the nomenclature of this concept 
became fairly wide to accommodate words like pangenome, core and dispensable 
genes, strain-specific genes (Medini et al. 2005; Tettelin et al. 2005), supragenome, 
distributed and unique genes (Lapierre and Gogarten 2009), and flexible regions 
(Rodriguez-Valera and Ussery 2012). Simply put, using the original definition, the 
core-genome describes the set of sequences shared by all members of the taxa of 
interest, the dispensable genome captures a subset of sequences shared by some 
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members of the group (dictating the diversity of the group: alternative biochemical 
pathways, niche adaptation, antibiotic resistance, etc.) while the pangenome is 
simply the union of core and dispensable genomes (describing the totality of taxa 
at the level of sequence datasets). 

The exponential growth of genomic databases started in 1995 with Haemophilus 
influenzae being the first complete genome project (Fleischmann et al. 1995). Today, 
as of August 2018, 110,660 complete whole-genome sequencing projects—of which 
87% are bacteria—and 15,066 finished whole-genome sequencing projects 
(Mukherjee et al. 2017) are available in the public domain. These fueled the interest 
of many researchers to carry out pangenome analysis at every conceivable phylo- 
genetic resolution level (Table 1), exploiting various modeling frameworks, assump- 
tions, and underlying homology search engines. 

A pivotal work in terms of phylogenetic resolution was carried out by Lapierre 
and Gogarten (2009), showing that on average in the largest bacterium group 
analyzed so far, the core gene set accounts only for 8% of the pangenome. 

The pangenome concept can be implemented either in reverse or in forward- 
thinking approaches; in the first case, we are interested to capture the genomic 
diversity of the group of interest, while in the second case we are more interested 
in exploring and predicting from a pragmatic perspective what is the minimum 
number of genome sequences required to capture the totality of the group. Obvi- 
ously, limited or sparse datasets might lead to erroneous conclusions; therefore, it 
was recommended (Vernikos et al. 2015) that the minimum number of genomes to 
analyze be at least five. 

The lifestyle of the species of interest is one of the parameters strongly dictating 
the distribution shape of the pangenome; for example, if by recurring addition of 
group members, the pangenome continues to grow, we are analyzing an open 
pangenome (such examples include human pathogens and environmental bacteria) 
(Hiller et al. 2007; Tettelin et al. 2008). On the other hand, if the group complexity is 
exhausted very fast even from the analyses of a handful of group members then we 
are dealing with a closed pangenome whereby we only need few representatives to 
describe the totality of the sequence variability. 


2 Technical Implementation 


In pangenome analysis, the sequence unit for the modeling can be anything from 
ORFs, genes, clusters of orthologous groups COGs (Tatusov et al. 1997), coding 
sequences (CDS), proteins, arbitrary sequence chunks, concatenated gene or protein 
entities, etc. 

Practical aspects of consideration that directly influence the validity of the 
conclusions drawn, include how quickly is expected a pangenome to grow and 
reach a plateau (open or close pangenome), the parameters that determine in the 
search engine the orthologous sequences and thereby directly affect the pool of core 
and dispensable sequence entities, the mathematical model and the applied 
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Table 1 Examples of the application of pangenome approaches at different levels of phylogenetic 


resolution 
Core size 
Level Organism Approach" | # Genomes | (# genes) | Year (reference) 
Species Streptococcus ORFsim, 8 1806 Tettelin et al. (2005) 
agalactiae Comb 
Neisseria ORFsim, 6 1337 Schoen et al. (2008) 
meningitidis Comb 
ORFsim, 20 1630 Budroni et al. (2011) 
Comb 
Borrelia ORFsim, 21 1200 Mongodin et al. 
burgdoferi Comb (2013) 
Escherichia coli ORFsim, 17 2344 Rasko et al. (2008) 
Comb 
Enterococcus ORFsim, 7 2172 van Schaik et al. 
faecium Comb (2010) 
Yersinia pestis ORFsim, 14 3668 Eppinger et al. (2010) 
Comb 
Streptococcus OG, Comb 11 1376 Lefebure and 
pyogenes Stanhope (2007) 
Clostridium OG, Comb 15 1033 Scaria et al. (2010) 
difficile 
Lactobacillus OG 34 1800 Smokvina et al. 
paracasei (2013) 
Campylobacter ORFsim, 130 1042 Meric et al. (2014) 
jejuni Ref 
Campylobacter ORFsim, 62 947 Meric et al. (2014) 
coli Ref 
Haemophilus FSM 13 1450 Hogg et al. (2007) 
influenzae 
Streptococcus FSM 17 1400 Hiller et al. (2007) 
pneumoniae ORFsim, 44 1666 Donati et al. (2010) 
Comb 
Staphylococcus FSM 16 2245 Boissy et al. (2011) 
aureus 
Moraxella FSM 12 1755 Davie et al. (2011) 
catarrhalis 
Lactobacillus FSM 17 1715 Broadbent et al. 
casei (2012) 
Gardnerella FSM 17 746 Ahmed et al. (2012) 
vaginalis 
Clostridium ORFsim, 13 2657 Bhardwaj and 
botulinum Comb Somvanshi (2017) 
Group Bacillus cereus ORFsim, 4 3000 Lapidus et al. (2008) 
Comb 
Bacillus subset of | ORFsim, 12 2009 Eppinger et al. (2011) 
species Comb 
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Table 1 (continued) 


Core size 
Level Organism Approach? | # Genomes | (# genes) | Year (reference) 
Genus Streptococcus OG, Comb | 26 600 Lefebure and 
Stanhope (2007) 
ORFsim, 52 522 Donati et al. (2010) 
Comb 
Prochlorococcus ORFsim, 12 1273 Kettler et al. (2007) 
Comb 
Bifidobacterium ORFsim, 14 967 Bottacini et al. (2010) 
Comb 
Listeria BMM 13 2032 den Bakker et al. 
(2010) 
Salmonella BMM 35 2811 Jacobsen et al. (2011) 
Shewanella OG 24 1878 Zhong et al. (2018) 
Finegoldia OG 12 1202 Briiggemann et al. 
(2018) 
Class Bacilli IMGM 172 143 Collins and Higgs 
(2012) 
Phylum Chlamydiae OG 19 560 Collingro et al. (2011) 
Super Eubacteria Gene freq. | 573 250 Lapierre and 
kingdom Gogarten (2009) 


“ORFsim ORF alignment similarity, Comb combinatorial approach of adding successive genomes, 
OG ortholog clusters, Ref initial generation of a reference pangenome using a subset of strains, FSM 
finite supragenome model, BMM binomial mixture model, /MGM infinitely many genes model, 
Gene freq gene presence/absence frequency 


distribution of forecasting the evolution of the pangenome and core-genome size. 
Another limiting factor, as the number of genomes becomes higher and higher, is the 
scalability of all possible genome addition permutations, since the total number of 
comparisons needed is described from the following function: 


N! 


C= ESTA nll 


where C is the total number of comparisons, and N is the total number of genomes. 

A workaround to an exhaustive approach is a method of subsampling (Vernikos 
et al. 2015) the total number of comparisons needed; comparisons are randomly 
selected making sure that each genome undergoes the same number of comparisons; 
the trick here is to set the number of possible comparisons to a number that will 
optimally balance the existing computational power and the target dataset size. 
Indeed, observations from limited in size datasets, showed that even extreme 
sampling is still able to model reliably the pangenome bypassing the need to follow 
an exhaustive all-against-all comparison (Fig. 1) (Vernikos et al. 2015). Additional 
optimizations can be achieved by exploiting alternative (to the original exponential 
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Fig. 1 Pangenome analysis (a) 
plots for Streptococcus 300 
agalactiae genomes (n — 8). 
(a) Number of new genes 
detected for adding a 
genome g to g — 1 genomes. 
Red bubbles: 1016 points 200 * 248 points (-M 5) 
for the total number of 
comparisons 

(no subsampling). Blue 
bubbles: 600 points 
(subsampling, multiplicity 
of 15). Green bubbles: 
248 points (subsampling, 50 

multiplicity of 5). (b) 

Regression curve on 0 

averages (the subsampling 2 4 6 
method has limited impact number of genomes 
on the outcome) 


250 
© 1016 points (all) 
* 600 points (-M 15) 
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new genes 
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decay) regressions functions; practical implementations of such optimizations are 
described in Tettelin et al. (2008), Eppinger et al. (2010, 2011), Mongodin et al. 
(2013) and Riley et al. (2012). 

Recently several stand-alone or server-based suites have become available for 
pangenome analysis; in the next paragraphs, we will review the most promising and 
interesting initiatives. See also Table 2 for additional details. 
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3 Bayesian Decision Model 


van Tonder et al. (2014) designed a methodology based on Bayesian decision model, 
able to analyze directly next-generation sequencing (NGS) data. The model defines 
the core-genome of bacterial populations allowing also the identification of novel 
genes. A nice caveat of this approach is that it can analyze even strains without a 
subset of genes since the model does not assume that all sequences have the entire 
core gene dataset present. The model has been benchmarked analyzing Streptococ- 
cus pneumoniae sequences. 


4 BGDMdocker 


BGDMdocker (Cheng et al. 2017) relies on docker technology to analyze and 
visualize bacterial pangenome and biosynthetic gene clusters. The pipeline consists 
of three stand-alone tools, namely Prokka v1.11 (Seemann 2014) for rapid prokary- 
otic genome annotation, panX (Ding et al. 2018) for pangenome analysis, and 
antiSMASH3.0 (Weber et al. 2015) for automatic genomic identification and anal- 
ysis of biosynthetic gene clusters. The visualization supports several options, includ- 
ing alignment, phylogenetic trees, mutations mapped on the phylogenetic branches, 
and gene loss and gain mapping on the core-genome phylogeny. Benchmarking took 
place on 44 Bacillus amyloliquefaciens strains. 


5 Bacterial PanGenome Analysis 


Bacterial Pangenome Analysis (BPGA) (Chaudhari et al. 2016), comes with a 
handful of new options and features most notably that of optimizing the speed of 
execution. In addition, it offers various entity (core-, pangenome, and MLST) 
phylogeny, phyletic profile analysis (gene presence/absence), subset analysis, atyp- 
ical sequence composition analysis, orthologous, and functional annotation for all 
gene datasets, user-selection of gene clustering algorithm, command line interface, 
and nice graphics. It runs both in Windows and in Linux as executables files (source 
code in Perl). BPGA has dependencies with other tools that require installation. In 
terms of input files, BPGA can “digest” the following file formats: GenBank (.gbk) 
files, protein sequence file (e.g.,.faa or .fsa or fasta format), binary (0,1) matrix 
(tab-delimited) file as output of other tools. The seven functional modules of BPGA 
algorithm include: Pangenome profile analysis, pangenome sequence extraction, 
exclusive gene family analysis, atypical GC content analysis, pangenome functional 
analysis, species phylogenetic analysis, and subset analysis. 
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6 ClustAGE 


ClustAGE (Ozer 2018) suite (both online and stand-alone) clusters noncore acces- 
sory sequences within a collection of bacterial isolates implementing the BLAST 
algorithm. It is therefore focused on the accessory genomic dimension of 
pangenome; Benchmarking of this tool has taken place on Pseudomonas aeruginosa 
genome sequences. 


7 DeNoGAP 


DeNoGAP (Thakur and Guttman 2016) does many more than pure pangenome 
analysis, including functional annotation, gene prediction, protein classification, 
and orthology search; therefore, it is applicable both for complete and draft genomic 
data. To do this, it implements a big set of existing analysis algorithms. In terms of 
scalability, it runs linearly due to implementation of iteratively refined Hidden 
Markov models. Its modular structure supports easy updates and addition of new 
tools. 


8 EDGAR 


Implementing phylogenetic concepts like average amino acid and nucleotide identity 
indices, an online application namely “EDGAR” (Blom et al. 2009, 2016) was 
developed to support comparative genomic analyses of related isolates. Strong 
utilities of the suite include Venn diagrams and interactive synteny plots, as well 
as ease of access to taxa of interest and quick analyses like pangenome vs. core plot, 
the core-genome and singletons. 


9 EUPAN 


EUPAN (Hu et al. 2017) is one of the first concrete attempts to analyze eukaryotic 
pangenomes, even at a relatively low sequencing depth supporting gene annotation 
of pangenomic dataset, genome assembly, and identification of core and accessory 
gene datasets exploiting read coverage. The tool has been benchmarked using 
453 rice genomes. 
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10 GET HOMOLOGUES 


GET HOMOLOGUES (Contreras-Moreira and Vinuesa 2013) is a customizable 
and detailed pangenome analysis platform (open source written in Perl and R) for 
microorganisms addressed to non-bioinformaticians. GET HOMOLOGUES can 
cluster homologous gene families using bidirectional best-hit clustering algorithms. 
The cluster granularity can be adjusted by the user based on various filtering 
strategies (e.g., by controlling key blast parameters such as percentage overlap and 
identity of pairwise alignments and E-score cutoff value). To estimate the size of the 
core- and pangenome, the tool supports both exponential and binomial mixture 
models to fit the data. 


11 Harvest 


Harvest (Treangen et al. 2014) is suitable for the analysis of (up to thousands of) 
microbial genomes. It hosts three modules, namely Parsnp (core-genome analysis), 
Gingr (output visualization), and HarvestTools (meta-analysis). Parsnp exploits 
jointly whole-genome alignment and read mapping to optimize accuracy and scal- 
ability aspects of sequence alignment; this approach can accommodate scalability for 
up to thousands of genomic datasets. For indexing purposes, it implements directed 
acyclic graph improving the identification of unique matches (anchors). The input of 
Parsnp is a directory of MultiFASTA files; the output includes core-genome align- 
ment, variant calls, and a SNP tree, all of which can be visualized via Gingr. Broadly 
speaking, this tool represents a compromise between whole-genome alignment and 
read mapping. Parsnp performance has been evaluated on simulated and real data. 


12 ITEP 


ITEP (Benedict et al. 2014) is a suite of BASH scripts and Python libraries that 
interface with an SQLite database backend and a large number of tools for the 
comparison of microbial genomes. ITEP hosts several de novo prediction tools such 
as sequence alignment, metabolic, clustering, and protein prediction. Users can 
develop their own customized comparative analysis workflows. 


13 LS-BSR 


LS-BSR (large-scale BLAST score ratio) (Sahl et al. 2014), calculates a score ratio 
(BSR value — query/reference bit score) per coding sequence (matrix) within a 
pangenome dataset using BLAST (Altschul et al. 1997) or BLAT (Kent 2002) for 
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all-against-all alignment purposes. The output (bit score per CDS) can be visualized 
as a heatmap. Benchmarking has taken place on Escherichia coli and Shigella 
datasets. 


14 micropan 


micropan (Snipen and Liland 2015) is an R package for the pangenome study of 
prokaryotes. The R computing environment supports several options of statistical 
analyses (e.g., principal component analysis), pangenome models (e.g., Heaps’ law), 
and graphics. External free software (e.g., HMMER3) is used for the heavy compu- 
tations involved. Benchmarking has been carried out on 342 Enterococcus faecalis 
genomes. 


15 NGSPanPipe 


NGSPanPipe (Kulsum et al. 2018) supports microbial pangenome analysis directly 
from experimental reads. Benchmarking has been carried out using simulated reads 
of Mycobacterium tuberculosis. The pipeline expects as input experimental reads 
and outputs three files, one of which is a binary matrix showing the presence/absence 
of genes in each strain; this matrix can be used as input to other pangenome tools like 
PanOCT (Fouts et al. 2012) and PGAP (Zhao et al. 2012). 


16 PanACEA 


PanACEA (Clarke et al. 2018) is an open source stand-alone computer program 
written in Perl that supports users to create an interconnected set of html, javascript, 
and json files visualizing prokaryotic pan-chromosomes (core and variable regions) 
generated by PanOCT (Fouts et al. 2012) or other pangenome clustering tools. 
PanACEA was developed to serve as an intuitive, easy-to-use, stand-alone viewer. 
Regions and genes can be functionally annotated to allow for visual identification of 
regions of interest. PanACEA's memory and time requirements are within the 
capacities of standard laptops. Benchmarking took place on 219 Enterobacter 
hormaechei genomes. 
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17 Panaconda 


Panaconda (Warren et al. 2017) creates whole-genome multiple sequence compar- 
isons and provides a model for representing the relationship among sequences as a 
graph of syntenic gene families, by discovering collision points within a group of 
genomes. The first step is to create a de Bruijn graph and use its traversal to build a 
pan-synteny graph; the alphabet used is based on gene families (instead of nucleotide 
alphabet). This approach is novel in the context of generating a graph, wherein all 
sequences are fully represented as paths. 


18 PanCake 


PanCake (Ernst and Rahmann 2013) is another tool for pangenome analysis (core 
and unique regions) relying exclusively on sequence data and pairwise alignments 
(nucmer or BLAST), which makes it annotation independent (i.e., it processes pure 
whole-genome content) It hosts a command line interface with several 
subcommands, allowing to add chromosomes, to specify a genome for each chro- 
mosome, to add alignments, to compute core and unique regions, and to output 
selected regions of the analyzed chromosomes. Benchmarking took place on three 
genera, namely Pseudomonas, Yersinia, and Burkholderia. PanCake is written in 
Python. 


19 PanFunPro 


PanFunPro (Lukjancenko et al. 2013) exploits functional information (profiles) for 
pangenome analysis. The suite supports among others calculation of core, and 
accessory gene datasets, homology search (all-against-all and pairwise 
sub-querying), functional annotation (HMM-based), and gene-ontology information 
analysis. PanFunPro is available both as a standalone (Perl) tool and as a web server. 
Benchmarking took place on 21 Lactobacillus genomes. 


20 PanGeT 


PanGeT (Yuvaraj et al. 2017) can digest both genomic and proteomic data in order to 
construct the pangenome for a selection of taxa, exploiting BLASTN or BLASTP, 
respectively. In terms of performance, it has been benchmarked using a set of 
11 Streptococcus pyogenes strains. The output is given in the form of a flower plot 
(core, dispensable, and strain-specific genes). 
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21 PanGFR-HM 


PanGFR-HM (Chaudhari et al. 2018), is putting an interesting view point on the 
“table” of pangenome, by analyzing exclusively microbes from the Human 
Microbiome Project; it is a web-based platform integrating functional and genomic 
analysis for a collection of ~1300 complete human-associated microbial genomes 
exploiting a novel dimensionality of analysis that of body site (location of the bug in 
the human body) when comparing different groups of organisms. 


22 PanGP 


PanGP (Zhao et al. 2014) supports scalable pangenome analysis by analyzing 
clusters of orthologs pre-computed by OrthoMCL (Li et al. 2003), PGAP (Zhao 
et al. 2012), Mugsy-Annotator (Angiuoli et al. 2011), or PanOCT (Fouts et al. 2012). 
In order to predict core and accessory gene datasets, the suite implements random or 
distance-guided sampling; in the latter, the genomic diversity (GD) drives the 
sampling of strain permutations. GD is modeled relying on three alternative assump- 
tions: GD is determined by the evolutionary distance on phylogenetic trees, the 
difference in gene numbers per strain, or by the discrepancy among gene clusters; 
among the three models the third seems more reliable (preferred model for PanGP). 


23 PANINI 


PANINI (Abudahab et al. 2018) is a web browser implementation for rapid online 
visualization and analysis of the core and accessory genome content, implementing 
unsupervised machine learning with stochastic neighbour embedding based on the 
t-SNE (t-distributed stochastic neighbour embedding) algorithm; this algorithm 
calculates first the similarities between the data (in high dimensional space) and 
then it minimizes the divergence between the two probability matrices over the 
embedding coordinates. PANINI expects as input the output of Roary (Page et al. 
2015). 


24 PANNOTATOR 


PANNOTATOR (Santos et al. 2013) supports the efforts of automatic annotation 
transfer onto related unannotated genomes exploiting the existing annotation of a 
curated genome. From this perspective, it is not a main pangenome analysis tool, but 
rather as a side-product of cross-comparison it provides pangenomic-related 
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information. Its main contribution though to pangenome analysis is to accelerate the 
functional annotation of closely related isolates. For this task, it implements a 
relational database, interactive tools, several SQL reports, and a web-based interface. 
The expected input is the DNA strand, the gene prediction plus the reference 
annotated genome. 


25 PanOCT 


PanOCT (Fouts et al. 2012) is a graph-based ortholog clustering tool for pangenome 
analysis of closely related prokaryotic genomes exploiting conserved gene neigh- 
borhood information to separate recently diverged paralogs into distinct clusters of 
orthologs where homology-only clustering methods cannot. PanOCT is utilizing 
BLAST (Altschul et al. 1997) and conserved gene neighborhood information. Four 
input files are expected including a tabular file of all-versus-all BLASTP searches 
and the actual protein fasta sequences. PanOCT is specifically designed for 
pangenome analysis of closely related taxa (in order to be able to distinguish groups 
of paralogs into separate clusters of orthologs). In terms of memory requirements, 
PanOCT is greedier than other tools used to benchmark its performance; the memory 
usage is unchanged until the sixth genome, with a usage of 0.25 GB per genome, 
maxing out at 0.5 GB per genome by the 25th genome. 


26 Panseq 


Panseq (Laing et al. 2010) builds pangenomes and identifies single nucleotide 
polymorphisms (SNPs) using genomic data as input. In addition, it produces files 
for further phylogenetic analysis exploiting both the information of SNPs as well as 
the phyletic profile of accessory sequences; all these wrapped-up with a user-friendly 
graphical user interface. 


27 Pan-Tetris 


Pan-Tetris (Hennig et al. 2015) is a Java-based tool that exploits an aggregation 
technique inspired by the Tetris game, to provide an interactive and dynamic 
visualization of the gene content in a pangenome table with the option of editing 
and on-the-fly modification of user-defined (pan) gene groups. The suite has been 
tested on 32 Staphylococcus aureus genomes. Pan-Tetris is one of the first attempts 
that enable modification of the computed pangenome. The computation of whole 
genome alignment exploits progressiveMAUVE (Darling et al. 2010) algorithm. 
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28 PanTools 


PanTools (Sheikhizadeh et al. 2016) suite supports the construction and visualization 
of pangenomes hosting online tools and algorithms; the visual representation of the 
pangenome is based on generalized De Bruijn graphs. The pangenome construction 
algorithm scales nicely even with large eukaryotic datasets. In addition to the basic 
pangenome tasks (construction and visualization), the suite supports other handy 
utilities such as adding, retrieving and grouping of sequences as well as annotating, 
reconstructing, and comparing genomes or pangenomes. Overall, it can easily 
support multi-genome read mapping, pangenome browsing, structure-based varia- 
tion detection and comparative genomics. It has been benchmarked on E. coli, yeast, 
and Arabidopsis thaliana genomes. 


29 PanViz 


PanViz (Pedersen et al. 2017) is a pangenome visualization tool with some analysis 
options. It can generate dynamic visualizations supporting both pangenome subset 
selection as well as mapping of new genomes to existing pangenomes. The input 
data needed is a pangenome matrix (gene group presence/absence across the 
included genomes), as well as a gene ontology-based functional annotation of each 
gene group. 


30 PanWeb 


PanWeb (Pantoja et al. 2017) is a web application that performs pangenome analyses 
based on PGAP pipeline, providing in addition a user-friendly graphical interface 
supporting multiple user-defined analysis queries. It can be implemented by users 
without computational skills. As input, it receives the annotation files for each 
genome in EMBL format. A complete set of graphs (e.g., pangenome, accessory, 
core-genome, and unique genes) is provided. 


31 panX 


panX (Ding et al. 2018) identifies orthologous gene clusters in pangenomes via a 
user-friendly and interactive web-based visualization. The visualization consists of 
connected components that allow further analysis. The suite provides alignment and, 
phylogenetic tree, it maps mutations of each gene cluster and infers gene gain and 
loss in the core-genome phylogeny. The pipeline breaks annotated genomes into 
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genes and then clusters them into orthologous groups. To identify homologous 
proteins, panX performs an all-against-all similarity search, while the actual cluster- 
ing of orthologous genes is carried out by a Markov clustering algorithm. 


32 PGAdb-Builder 


PGAdb-builder (Liu et al. 2016), constructs a pangenome allele database (PGAdb) to 
empower whole genome multilocus sequence typing (wgMLST) analyses and 
operates as a web service suite. Two modules are implemented, namely 
Build PGAdb for building a PGAdb database and Build wgMLSTtree for 
constructing a wgMLST tree and determine the genetic relatedness of the input 
sequences; both modules “digest” genome contigs in FASTA format. PGAdb- 
builder, has however dependencies with other existing suites like Prokka (Seemann 
2014) and Roary (Page et al. 2015). 


33 PGAP 


PGAP (Zhao et al. 2012) supports pangenome analysis and in addition analysis of 
functional gene clusters, species evolution, genetic variation, and functional enrich- 
ment of query sequences. It outputs the basic pangenome structure and growth curve 
and in addition SNP and genomic variation information, phylogenetic, and func- 
tional annotation metadata. Benchmarking has taken place on Streptococcus 
pyogenes datasets. 


34 PGAP-X 


Building on PGAP, and in order to more effectively interpret and visualize the 
results, PGAP-X (Zhao et al. 2018) was developed. The visualization utility can 
intuitively lead to conclusions on pangenomic structure, conserved regions and 
overall on genetic variability throughout the pangenomic datasets at hand. 
Benchmarking has taken place on S. pneumoniae and Chlamydia trachomatis 
datasets. One current limitation of PGAP-X (that is not present in PGAP) is that it 
expects as input only complete genomes. 
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35 Piggy 


Piggy (Thorpe et al. 2018) is a tool for analyzing the intergenic component of 
bacterial genomes and it is designed to be used in conjunction with Roary (Page 
et al. 2015). The latter works by analyzing protein-coding sequences thus excluding 
nonprotein-coding intergenic regions (IGRs) which typically account for approxi- 
mately 15% of the genome. Piggy matches Roary except that it is based only on 
IGRs. Benchmarking took place on Staphylococcus aureus and Escherichia coli 
using large genome datasets. In terms of input and output, Piggy uses the same 
format as in Roary and has similar running time requirements. Piggy provides a 
means to rapidly identify IGR switches, with many evolutionary applications 
including analysis of the role of horizontal transfer in shaping the bacterial regulome. 


36 pyseer 


pyseer (Lees et al. 2018), is geared toward genome-wide association studies in the 
“world” of microbes with the task at hand to identify potential genetic variation 
linked with certain phenotypic aspects. Pyseer is actually a python implementation 
of a previous initiative written in C++, namely SEER (Lees et al. 2016). The 
foundation of pyseer is the use of K-mers (words) of variable length (input) coming 
from draft assemblies, while using a generalized linear model for each word their 
link with a potential phenotype is evaluated. In addition, multidimensional scaling of 
a pairwise distance matrix is implemented in order to control for population structure 
(embedded in the regression analysis). 


37 Roary 


Roary (Page et al. 2015) enables the construction of large pangenomes even on a 
typical desktop machine, yielding fairly accurate output. For example, it can digest 
up to 1000 strains (13 GB of RAM) building the pangenome in ~4 h. Roary achieves 
high accuracy which is attributable to utilization of the context of conserved gene 
neighborhood information. A suite of command line tools is provided to interrogate 
the dataset providing union, intersection, and complement. 
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38 seq-seq-pan 


seq-seq-pan (Jandrasits et al. 2018) is a workflow for the sequential alignment of 
sequences to build a pangenome data structure and a whole-genome alignment. seq- 
seq-pan builds a pangenome data structure allowing editing (addition or removal) of 
genomes from a set of aligned sequences and subsequent re-alignment of the whole- 
genome sequences; for whole-genome alignments it relies on progressiveMauve 
(Darling et al. 2010). The alignment is optimized for generating a representative 
linear presentation of the aligned set of genomes. 


39 Spine and AGEnt 


Spine (Ozer et al. 2014) determines the core-genome from a group of genomic 
sequences and AGEnt (Ozer et al. 2014) identifies the accessory genome in draft 
genomic sequences. They both use nucmer to align sequences. The pipeline has been 
tested on genome sequences of Pseudomonas aeruginosa. However, as mentioned 
by the authors, whole genome alignment of reference genomes and core-genome 
identification with Spine can be time-consuming. 


40 SplitMEM 


SplitMEM (Marcus et al. 2014) scales linearly in terms of time and space in relation to 
the number of genomes of interest. To do this, it traverses suffix trees (for the genomes) 
and builds compressed de Bruijn graphs of pangenomes. In terms of notation, nodes 
within the graph represent conserved or strain-specific sequences of the pangenome. 
Benchmarking has taken place on Bacillus anthracis and E. coli datasets. 


41 Highlights 


Pangenome analysis has today many options when it comes to practical implemen- 
tation. Depending on the analysis focus, the desired input and output, the depend- 
ability on other algorithms, as well as the modeling parametrization, users have 
many options to choose from. In the current review, we highlight the following five 
tools: BPGA (Chaudhari et al. 2016) for its very fast execution time, the intuitive 
handling and the user-defined clustering algorithm, Roary (Page et al. 2015) due to 
its internal processing (clustering of high similarity sequences) that results in linear 
memory consumption, LS-BSR (Sahl et al. 2014) that similarly to Roary performs 
pre-clustering reducing substantially the running time, PanOCT (Fouts et al. 2012), 
which takes into account both homology and positional gene neighborhood 
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information and PGAP (Zhao et al. 2012) that can work also with draft forms of 
genomic data such as annotated assemblies. 


42 Food for Thought 


The final results and conclusions of a pangenome analysis, among others, massively 
depend on the following aspects, that need thoughtful consideration prior to 
embarking any such project: Homology search algorithm, the phylogenetic sample 
at hand, the pangenome model implemented and the type and quality of sequence 
entities (e.g., DNA, protein, presence/absence—phyletic profile, and SNPs). 

For example, when it comes to homology definition based on sequence similarity 
there is a wide range of similarity thresholds used in previous attempts: i = 50%, 
L = 50% (Tettelin et al. 2005), i = 70%, L = 70% (Hiller et al. 2007), i = 70%, 
L = 50% (Meric et al. 2014), i = 30%, L = 80% (Bentley et al. 2007), where i stands 
for sequence identity and L for sequence length. 

The starting level (ORFs, CDSs, genes, proteins, SNPs) and the quality (in silico, 
manual curation) of annotation as well as inherent bacterial genomic complexity at 
the sequence level such as low complexity repeats, recombination hot spots, hori- 
zontally acquired genomic fragments constitute other important aspects of consid- 
eration. Such information variability can massively affect the predicted conserved 
and unique genes in favor of the former or the latter; this might also determine the 
structure of pangenome (open or closed). 


43 Conclusions 


Being able algorithmically to digest the largest possible pool of data available is 
critical in order to approach more reliably the phylogenetic history of bacterial 
populations. Indeed such comparative genomic analyses started by exploiting 
~0.07% of a genome (16s rRNA) (Woese 1987), latter on using up to ~0.2% of 
the genomic information (MLST) (Maiden et al. 1998), and recently up to 100% of 
the information exploiting the pangenome wealth of data (Medini et al. 2005; 
Tettelin et al. 2005). 

The recent explosion of sequencing projects replaced the limiting factor of data 
sparsity with the immense data dimensionality (Vernikos 2010) and we are now in 
the middle of a transformation moving from top-down (trying to fit the limited data 
to the model) to bottom-up approaches in an attempt to move from the "infant" stage 
of single-strain genomics to the post pangenome era of “adulthood.” The model 
assumptions therefore become less and less pivotal as the pace of primary data 
generation continues to grow exponentially, asking not for modeling superpower but 
instead interpretation and connecting the dots super skills. 
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Abstract Prokaryotes demonstrate tremendous variation in gene content, even 
within individual bacterial clones or lineages. This diversity is made possible by 
the ability of bacteria to horizontally transfer DNA through a variety of mechanisms, 
and the extent of such transfer sets them apart from eukaryotes. What has become 
evident through interrogation of thousands of bacterial genomes is that gene varia- 
tion is directly related to the ecology of the organism and is driven by continual 
processes of niche exploration, diversification, and adaptation. Of course, the acqui- 
sition of new genes is not necessarily beneficial, resulting in either the removal of 
that individual through purifying selection or the occurrence of compensatory 
mutations in the genomic “backbone” (i.e., core genes) that become epistatically 
linked to the presence accessory genes. There are now numerous examples of 
relationship between gene variation and niche adaptation. We explore some of 
those examples here as well as the population genomic footprint left by the dynamics 
of gene flow, diversification, and adaptation. 
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1 Introduction 


Pangenomes of bacterial species show a tremendous range of diversity in size, 
content, and fluidity. In comparing the core genome size in relation to the accessory 
genome, some species possess relatively limited pangenomes while others are 
expansive. Accessory genomes may be composed of genes belonging to phages, 
transposons, insertion sequences, and plasmids, as well as genes that have diverged 
through mutation and recombination to the point where they are considered as a 
separate homolog. Some of these genomic elements may be relatively stable (e.g., an 
integrated prophage), while others may be gained and lost within a single bacterial 
culture (e.g., plasmids). In this chapter, we will discuss the population genomics of 
pan-, and more specifically, accessory genomes, specifically detailing how accessory 
genomes vary among and within bacterial species and the implications this variation 
has for microbial ecology. Throughout this discussion, it is important to not lose 
sight of what we are referring to with the catch-all phrase "accessory." These are the 
dynamic elements of the genome, often containing large genomic islands that 
augment the bacterium's phenotype, which may, as we will outline, be used to 
glean knowledge of ecology and evolutionary history of a genus, species, or set of 
lineages. Further, in no way does the term accessory or the misleading synonym 
"dispensable" suggest non-essential, as some "accessory" genes actually represent 
divergent variants of an essential gene. 


2 Mechanisms of Pangenome Variation 


The content and diversity of a bacteria's accessory genome are directly associated 
with the mode and frequency of horizontal gene transfer (HGT), which in turn is 
tightly linked to ecology. Modes of HGT include transformation: the uptake and 
integration of exogenous DNA from the environment, transduction: the introduction 
of exogenous DNA into the bacterial cell through a viral vector (e.g., bacteriophage), 
and conjugation: the direct transfer of DNA between two bacterial cells through a 
pilus, which usually involves plasmids and transposons. Bacteria vary in the degree 
to which each of these mechanisms occurs within their populations and in their DNA 
uptake mechanisms. It is also almost certain that other variants of these mechanisms 
remain to be discovered, as illustrated by recent work describing "lateral transduc- 
tion" capable of transferring genomic regions of remarkable size (Chen et al. 2018). 

Integrative and conjugative elements (ICE) include integrative plasmids and 
conjugative transposons, which are circularized mobile elements transferred through 
conjugation. ICE may harbor a number of genes important to virulence, specialized 
metabolism, and survival, and are the primary means by which antibiotic-resistant 
genes are transmitted among bacteria. Plasmids may contain anywhere from 5 to 
100 or so genes, allowing for a lineage to gain or lose many loci in a single step, 
especially for those species with high plasmid diversity. Phylum Proteobacteria, 
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which includes several pathogenic species from genera Escherichia, Salmonella, 
Vibrio, Helicobacter, Yersinia, and Legionellales possess some of the most prevalent 
and diverse plasmids with a wide host range (Shintani et al. 2015). Therefore, 
unsurprisingly, species among these genera have moderately large pangenome 
sizes (McInerney et al. 2017). 

Naturally, competent (transformable) species are able to uptake DNA directly 
from the environment resulting in homologous or nonhomologous recombination, 
the latter frequently associated with gene gain (Croucher et al. 2012). Arguably, the 
most famous of these species, Streptococcus pneumoniae, was made so by its role in 
the Griffith experiments in 1928, which led to the identification of DNA as the 
conveyor of genetic information. Through those experiments, Griffith observed that 
“smooth” (i.e., unencapsulated) avirulent S. pneumoniae could become virulent 
through exposure to heat-killed virulent “rough” (i.e., encapsulated) pneumococci 
(Griffith 1928). We now know that what he observed was transformation resulting in 
the acquisition of the capsular polysaccharide (CPS) loci that code genes responsible 
for the synthesis and polymerization the antigenic serotype capsule. There are over 
90 serotypes identified and the CPS loci span 10,337—30,298 bp with at least 
26 coding sequences depending on the particular serotype (Bentley et al. 2006). 
Therefore, this single recombination event resulted in the acquisition of 26 accessory 
genes. Since then, other species including Neisseria gonorrhoeae, Campylobacter 
jejuni, Vibrio cholerae, and Haemophilus influenzae have been found to be naturally 
competent. 

Another method by which transformation may result in differences in gene content 
is through events that lead to gene diversification, which are frequently observed 
among several species as recombination “hotspots.” The primary effect of these 
events is antigenic variation in genes linked to host-pathogen interactions. For 
example, among pneumococci, two virulence factors, pneumococcal surface proteins 
A and C (pspA and pspC), are known to have 3 and 11 variants, respectively 
(Hollingshead et al. 2000; Iannelli et al. 2002). These variants are diverse in length, 
structural organization, and nucleotide variation, the results of frequent recombina- 
tion events. Most important, they are different in serology, which has significant 
implications for host immunity (Azarian et al. 2016; Georgieva et al. 2018). Simi- 
larly, among gonococci, the opa and neighboring pil loci are highly mosaic due to 
recombination of existing alleles (Bilek et al. 2009). The gene product Opa is an outer 
membrane adhesion protein that is important for colonization and invasion of the 
genital and nasopharyngeal mucosal epithelium. As a note, antigenic variation 
through recombination leads to an interesting contradiction in terminology. In both 
of these examples, pspA, pspC, opa, and pil are considered “core” genes in the sense 
that each member of their respective species possesses a variant. They are by all 
definitions "essential" to core cell function; yet, through current methods of 
pangenome analysis that are commonly based on a nucleotide homology level of at 
least 80%, they are identified operationally as accessory genes. Finally, transduction 
through temperate bacteriophages may introduce considerable gene variation in both 
Gram-negative and Gram-positive bacteria (Feng et al. 2008; Waldor and Friedman 
2005). While their precise evolutionary impact in most cases remains unclear, it is 
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certain that their pathogenesis plays a significant role in the biology of their host. For 
example, many phages harbor genes coding for virulence factors including toxins or 
secreted enzymes (Romero et al. 2009); therefore, prophages (bacteriophages inte- 
grated into host bacterial genomes) represent a significant mechanism for variation of 
virulence among closely related bacteria (Fortier and Sekulovic 2013). In relation to 
pangenome dynamics, the transmission of bacteriophages can result in significant 
variation among bacterial populations on short timescales by two mechanisms: 
through (1) the direct integration of the prophage and (2) the acquisition or evolution 
of antiphage mechanisms. The later may involve phage-inducible chromosomal 
islands and CRISPR-Cas systems (Reyes-Robles et al. 2018), which independently 
represent instances of gene acquisition and a source of pangenome variation. 
Predator-prey dynamics of bacteriophages and their host has been widely observed 
with Siphoviridae phages and S. pneumoniae (Romero et al. 2009), lamba 
STX-coding phage in Shiga toxin-producing E. coli, and ICP (Myoviridae) and 
CTX phages in Vibrio cholerae (Seed et al. 2011; Waldor and Friedman 2005), 
among myriad others. The result is highly variable prophage content even within 
closely related members of bacterial lineages (Croucher et al. 2014). 


3 Population Genomics of Pangenomes 


Today, the identification of a bacterial sample's core genome is a common interme- 
diate step among bioinformatics pipelines for preparing whole-genome sequencing 
data for phylogenetic analysis. Historically, the accessory genome was largely 
ignored with the exception of the identification of important genes such as those 
conferring antibiotic resistance or increased virulence. Methodologically, it was 
difficult to scale accessory genome analysis to large population samples of a species 
and especially across several species. Then, the discovery that in three diverse E. coli 
isolates, less than 4096 of the genes was found in the genomes of all three demon- 
strated that extensive variation was possible (Welch et al. 2002). A subsequent study 
of just eight genomes of Streptococcus agalactiae (Group B Streptococcus) 
published in 2005 identified 1806 core genes and 439 “dispensable” genes, 
highlighting that tremendous variation could be observed with even a small sample 
(Tettelin et al. 2005). This chapter introduced the concept of the pangenome. Now, 
large-scale analyses of pangenomes continue to reveal significant diversity even over 
short timescales, providing information about the demographic history and adaptive 
evolution of bacteria. These studies have shown that pangenome size and diversity 
vary among species and depend on lifestyle (McInerney et al. 2017; Ochman and 
Davalos 2006). 

McInerney and colleagues recently summarized the range of diversity observed 
among bacterial species (McInerney et al. 2017). Pangenome sizes ranged from 
974 for the obligate intracellular bacteria Chlamydia trachomatis to 40,362 for the 
semiaquatic agricultural Oryza sativa. Comparing sizes of accessory genomes in 
relation to the total number of genes in the pangenome, O. sativa had the smallest, 
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just 8% of genes were accessory, while in Salmonella enterica, a staggering 83% of 
its 10,267 genes are found in the accessory genome. Assessing “genomic fluidity” is 
another method for quantifying pangenome diversity (Kislyuk et al. 2011). Instead 
of assessing the relationship between core and accessory genome size, genomic 
fluidity measures the dissimilarity of genomes evaluated at the gene level calculated 
as the “ratio of unique gene families to the sum of gene families in pairs of genomes 
averaged over randomly chosen genome pairs from within a group of N genomes.” 
In a comparison of genomic fluidity among seven species known to undergo HGT, 
Neisseria meningitidis, Escherichia coli, and Streptococcus spp. ranked highest in 
genomic fluidity (Kislyuk et al. 2011) (although it should be noted that this metric is 
expected to be affected by the sample chosen for study). 

Within a species, accessory genome diversity increases with core genome diver- 
gence and models of homologous recombination and HGT have shown how these 
processes lead to the formation of population structure (Croucher et al. 2014; 
Marttinen et al. 2015). Boundaries for HGT across species roughly follow the 
same trajectories. Species in genera Streptococcus, Neisseria, and Campylobacter, 
for example, have been shown to engage in HGT more frequently with closely 
related members (e.g., between S. pneumoniae and S. mitis and S. oralis, and 
N. gonorrhoeae and N. meningitidis). Therefore, the size and distribution of acces- 
sory genes in a population provide insights into the demographic history of bacterial 
species as well as delineations of species boundaries. 

As we have described, many methods can generate accessory genome diversity. 
While not wholly analogous to the way nucleotide mutations arise and propagate in a 
population, the gain and loss of genes nonetheless inform the shared evolutionary 
history of a population in the same manner. Genomic islands acquired through HGT 
often become relatively fixed in bacterial lineages (Croucher et al. 2014) with the 
number of acquired genes increasing with lineage age (Donati et al. 2010). This is 
especially true for Staphylococcal Cassette Chromosome mec (SCCmec) elements in 
clones of S. aureus (International Working Group on the Classification of Staphy- 
lococcal Cassette Chromosome Elements (IWG-SCC) 2009), pathogenicity islands 
among toxigenic and non-toxigenic lineages of V. cholerae (Wozniak et al. 2009), 
and CPS loci in pneumococci (Bentley et al. 2006). These mobile elements, there- 
fore, inform long-scale evolutionary history, while in the short term, prophage 
variation and the scars of transformation events reflect more recent events. As 
such, it is possible to recapitulate the core genome phylogeny of a population 
through phylogenetic reconstruction using a presence—absence alignment of acces- 
sory genes, represented by 1's and O's, respectively (Azarian et al. 2018). In essence, 
this represents a tight linkage between core genome single nucleotide polymor- 
phisms and the history of gene gain and loss. This may, of course, oversimplify 
the complex interconnected processes that led to accessory gene variation, but it does 
provide an easy data structure that may be investigated to understand how bacterial 
populations change over time. 

An interesting approach to assessing temporal changes in bacterial population 
genomics is to consider the dynamics of the accessory genome. The clearest 
examples of this are observations of rapid changes in virulence or antibiotic 
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resistance among bacterial lineages, often leading to short-term success of a clone 
(Croucher et al. 2014). The impact of human interventions, namely vaccines, affects 
not only the distribution of lineages in a population but also the available pool of 
accessory genes. For example, if an ICE is strongly associated with a lineage, and 
that lineage is targeted by vaccine, then the removal of the lineage from the 
population may ultimately remove the reservoir for that ICE. The impact of vaccine 
on the pathogen population of S. pneumoniae has been extensively studied (Azarian 
et al. 2018; Croucher et al. 2013). After the introduction of the seven-valent 
pneumococcal conjugate vaccine (PCV7) in the USA, an analysis of a sample of 
616 genomes of pneumococci carried in children in Massachusetts showed the 
removal of accessory genes associated with the CPS loci of vaccine serotypes 
(Croucher et al. 2013). In addition, the prevalence of antibiotic-resistant genes 
associated with two transposons was shifted due to the removal of two vaccine 
lineages they were associated with and the subsequent emergence of a non-vaccine 
lineage harboring one of the transposons. A study of pneumococcal population 
dynamics over 13 years and spanning the introduction of the PCV7 showed that 
the introduction of vaccines greatly shifted the frequencies of accessory genes in the 
population (Azarian et al. 2018). Surprisingly, the frequencies of accessory genes 
then shifted back to pre-vaccine values as the pneumococcal population recovered 
from the removal of nearly 30% prevalent genotypes targeted by vaccine. This 
observation was elucidated by recent work by Corander and colleagues who inves- 
tigated accessory gene frequencies across of 4127 pneumococcal isolates from four 
distinct geographic areas (Corander et al. 2017). They found that accessory genes 
had similar frequencies in the four populations despite significant differences in 
lineage composition and the timing of vaccine use. Through functional analysis of 
the accessory genes and population dynamic modeling, they proposed that the 
frequencies of accessory genes are shaped by negative frequency-dependent selec- 
tion (NFDS) through pathogen-pathogen, host-pathogen, and pathogen-environ- 
ment interactions. Classically defined, in an NFDS model the fitness of a phenotype 
depends on its frequency relative to other phenotypes in a population. The same 
NFDS model has been used to explain the diversity of protein antigens among 
pneumococci, which we briefly touched upon early in the chapter. In the case of 
protein antigens, increasing host immunity toward an antigen drives diversification 
of the gene coding for the protein either through mutation, or most often, recombi- 
nation. The same dynamic can be observed with prophages and restriction modifi- 
cation systems that defend against infection. Ultimately, these observations point to 
a central hypothesis for accessory genome variation, that difference in gene content 
are linked to adaptation and niche specialization, but that in the case of NFDS the 
niche may be dynamically generated by fluctuating frequencies of loci in the 
pangenome. 
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4 The Ecological Significance of Pangenomes 


The observation of pangenomes as a common feature of many bacteria begs the 
question of what has selected them? What are the ecological features that lead to the 
pervasive association of a core, with a disseminated complement of many additional 
genes, some shared with other species? While some have clear selective conse- 
quences, most are obscure. The extent to which bacteria vary in gene content sets 
them apart from eukaryotes, and is just one of the reasons we cannot easily transfer 
population genetic concepts between the superkingdoms of life. One metaphor for 
bacteria and their varying genome content compares them to modern smartphones 
(Young 2016) in which the core genome is the operating system, the accessory 
genome is the apps downloaded to the phone, and the pangenome would be 
everything in the app store. In the following, we divide up the accessory genes 
that combine to make up a pangenome into various categories, not by function but by 
how they are distributed among lineages in the population. 

The perspective we take is of the bacterial genome as a transient construct. Loci 
can be added to it, and selected to become more common or indeed lost from the 
population, should they no longer be necessary. The pangenome for any sample is 
the totality of genes currently associated with its contents. This need not be a 
permanent or even especially long relationship. Consider a locally prominent pro- 
phage, which might not be present in the same population if you returned at a later 
date. Indeed we can imagine that given the many ways bacteria engage in HGT, a 
sample of sufficient size will contain many loci in a new genetic background that are 
yet to be lost (analogous to incomplete purifying selection (Rocha et al. 2006). A 
subset of the pangenome, expected to be rare in any reasonably large sample, is 
genes that are either infrequently obtained or actively selected against. In general, the 
extent of gene flow will be regulated by the genetic and ecological similarity of the 
bacteria and the compatibility of the genetic background to adapt to the acquisition 
of novel genes (Wiedenbeck and Cohan 2011). 

Moving to loci that are present at intermediate frequencies, say between 5% and 
95% of isolates, we can distinguish between loci that are restricted to a few lineages, 
or are widely disseminated but not fixed in any lineage. These suggest different 
evolutionary scenarios. Dealing with the latter first; a locus that is easy to obtain but 
hard to hold onto suggests fluctuating selection. We see it more often than the genes 
in the previous category, because it provides selective benefits. However, these are 
not consistent benefits or we would expect the gene to rapidly become more common 
and indeed part of the core. Examples of these include drug-resistant genes in 
lineages that lack compensatory mutations, and as such only experience a selective 
benefit in the presence of the drug (Blanquart et al. 2018; Cobey et al. 2017; Lehtinen 
et al. 2017). 

In contrast, loci fixed in a lineage might represent the ecological “address” of 
those bacteria, a dimension of their niche. However, this need not be the case. 
Studies of populations of S. pneumoniae have shown that the accessory loci in this 
species are not widely disseminated, but are also rarely restricted to a single lineage 
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and are instead shared among several, in different combinations (Croucher et al. 
2014). It has been suggested that different combinations of accessory loci might be 
selected in different populations, depending on the overall frequencies of the indi- 
vidual genes, as a result of negative frequency-dependent selection (Corander et al. 
2017). At present, this remains a hypothesis without definitive proof. 

We must also recognize that a locus might have no wider ecological significance 
whatsoever. Toxin-antitoxin genes can drive their own acquisition and maintenance, 
to say nothing of the multitudes of transposable elements, prophage and the like 
(Wozniak and Waldor 2009). Bacterial genomes are characterized not only by their 
variable gene content, and the transience of the associations between loci (long for 
core genes, short for others) but by the divergent selective processes affecting them. 
In some cases (the core) these are aligned, while in others they are not. Population 
geneticists who study sexually reproducing eukaryotes are familiar with the notion 
that the selective interests of different loci in the same genome may differ. The 
shuffling of genetic information in each generation effectively uncouples the asso- 
ciation between all but the closest loci, but even the most frequently recombining 
bacteria (Arnold et al. 2018) do not approach the state of sexually reproducing 
eukaryotes. As a result, the overall fitness of a bacterial genome is the product of 
all the loci making it up. To preserve this overall fitness, it has been proposed that 
homologous recombination in bacteria is an adaptation to prevent the colonization of 
the genome with selfish genetic elements, by rapidly replacing them with the 
homologous region in the ancestral strain, which lacks the additional gene (although 
this does not explain the notable variation among bacteria in their recombination 
rates) (Croucher et al. 2016). One of the greatest challenges in providing a satisfying 
account of bacterial population genetics has been separating the patterns that are the 
result of selection, from those of linkage. 

The question of how the individual loci that make up the accessory or dispensable 
part of the pangenome, associate themselves with the lineages that are defined by the 
core component, has come under increasing scrutiny as the numbers of population 
genome samples have increased. Population genetic models for the core genome 
specifically developed with bacteria in mind, and capable of handling the various 
amounts of homologous recombination, are not common. Rarer still are models that 
explicitly consider the gain and loss of genes from the accessory genome. Although 
gain and loss of loci is not unknown in eukaryotes, and has been implicated in some 
major adaptive events (McInerney 2017; Schónknecht et al. 2014) it is nowhere near 
as extensive and does not have anything like the impact it does in bacteria. Accurate 
models for such processes are crucial to detect departures from neutrality, and 
several studies have actually found apparently neutral associations between elements 
of the pangenome and the core. However there are reasons to think that the sequence 
variation associated with the accessory genome may produce fundamentally differ- 
ent results from those in population genetics textbooks. For example, if the site 
frequency spectrum expected under neutral assumptions is extended to allow muta- 
tions in loci that can be gained or lost, systematic bias results (Baumdicker 2015; 
Baumdicker et al. 2012; Collins and Higgs 2012). 
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Given that the accessory fraction of the pangenome is enriched for loci involved 
in properties from toxin production, to restriction-modification systems, and surface 
antigens, to say nothing of drug-resistant genes, it is hard to imagine that it might fit 
well to a neutral model—in several cases though, it does (Baumdicker et al. 2012; 
Marttinen and Hanage 2017). This result is hard to accept, and it should be given all 
we know about the power of selection and the size of bacterial populations. How- 
ever, it should be appreciated that a multitude of selective scenarios can produce a 
signal that is hard or impossible to distinguish from neutrality. Study of other metrics 
may be required to unveil the underlying processes. For instance, the rates with 
which diverging strains of pneumococcus acquired or lost genes was found to be 
indistinguishable from neutrality and even to yield good estimates of the population 
mutation and recombination rates (Marttinen and Hanage 2017). Yet later analysis of 
the same population, alongside others, was interpreted as strong evidence for 
negative frequency-dependent selection on the accessory fraction of the genome 
(see above). What is going on? 

A possible explanation lies in the central limit theorem. If an outcome is deter- 
mined by many independent random variables, each with finite variance, then we 
expect the result of adding them all together to be a normal distribution. In other 
words, if the fitness of a strain is the consequence of many independent factors, we 
might find it appears neutral—the chances of any individual getting into the next 
generation could be normally distributed around 50:50. This result has been the 
source of substantial interest in ecology, given that it can be used to show that 
species abundance distributions (SADs—a common metric for summarizing eco- 
logical diversity (McGill et al. 2007) can appear neutral while actually being the 
result of many non-neutral processes. In the case of bacteria, the fitness effects of 
genes on the same mobile element may not be independent, however, the effects of 
multiple mobile elements may similarly approximate to an overall strain fitness not 
distinguishable from neutrality. Other models from community ecology may be 
useful in determining the contents of genomes, as well as ecosystems. 

Nevertheless, the current consensus in the field is that gene variation directly 
reflects the ecological niche occupied by the bacteria (Sheppard et al. 2018) and the 
response to local selective pressures (Cordero and Polz 2014). This may involve the 
acquisition of antibiotic-resistant genes, as described above, metabolic genes needed 
to exploit a novel energy source, bacteriocins for microbial warfare, or phage and 
phage-defense genes involved in predation-prey "paper-rock-scissor" dynamics, as 
so eloquently described by Corander and colleagues (Arnold et al. 2018). Further, it 
is suspected that rapid acquisition and dissemination of genes most often occurs as 
bacterial clones adapt to a novel niche previously occupied by another species (Polz 
et al. 2013; Popa et al. 2011; Smillie et al. 2011; Vos et al. 2015). An example of this 
would be the acquisition of IncA/C plasmid by Vibrio cholerae introduced to Haiti, a 
country previously devoid of epidemic cholera for at least 100 years (Carraro et al. 
2016) as well as the post-vaccine population of S. pneumoniae in the USA, which 
experienced a significant population shift after the 7-valent pneumococcal conjugate 
vaccine removed approximately 3096 of the pre-vaccine population. Niches them- 
selves are not explicitly segregated, and therefore one does not have to be vacated to 
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then be exploited by a newcomer. Gene flow may occur between sympatric lineages; 
i.e., habitat borders are not defined by walls or other barriers, and recombination can 
occur among lineages of a species where habitat space is not clearly demarcated 
(Marttinen and Hanage 2017). This model explains lineage divergence and popula- 
tion structure among several species, and is important because it highlights that a 
species requires not only the ability to acquire genes but also the opportunity to do 
so. Interestingly, it has also been suggested that once a competent species encounters 
a new niche, it can give rise to noncompetent lineages, providing an advantage when 
adaptation through gene acquisition is not required and may, in fact, be deleterious 
(Jorth and Whiteley 2012). 

The acquisition of genes is not always beneficial and may, in fact, be deleterious 
(Vos et al. 2015). Indeed, for every successful lineage that is observed, there are 
likely several "failed" ecological experiments. Since there is not a clear delineation 
between the fitness gain and costs of gene acquisition, it may be an oversimplifica- 
tion of the dynamics to ascribe a net-positive or net-negative effect of gene gain and 
loss. The truth, of course, is somewhere in between and likely varies to different 
degrees between species. To offset fitness costs and compensate for the acquisition 
of mobile elements, mutations may arise in core loci and form epistatic relationships 
with the acquired gene. This has been suggested, for example, in E. coli, where 
nucleotide substitutions in regulatory genes were found to be associated with the 
acquisition and maintenance of accessory genes (McNally et al. 2016). This dynamic 
is further supported, in part, by recent findings of epistatic interactions across 
genome-wide loci among multiple bacterial species (Arnold et al. 2018; Skwark 
et al. 2017). 

There are examples where ecological niches are clearly defined among species 
and others where the relationships between habitat and organism are obscured. In the 
E. coli study (McNally et al. 2016), ecological adaptation and niche segregation were 
not observed among isolates collected from humans and animals, while in other 
species such as Campylobacter, this is commonly observed (Sheppard et al. 2011). 
Methods to investigate gene flow and selection in the context of adaptation and 
ecology are continually being refined. In some instances, identifying the appropriate 
system to test ecological hypotheses is the limiting step. An intriguing approach to 
understanding these associations is not to identify niches, the organisms that inhabit 
them, and then attempt to resolve the genes associated with adaptation, but instead 
first assess gene flow and then make predictions about ecology. So-called “reverse- 
ecology" proposed by Shapiro and Polz seeks to investigate habitat specificity by 
assessing gene flow and gene-specific sweeps, and has been used to predict ecolog- 
ical differentiation of Vibrio spp. in aquatic environments (Hunt et al. 2008; Shapiro 
et al. 2012; Shapiro and Polz 2014). They demonstrate an example of applying a 
fresh perspective to an appropriate model system to understanding bacterial ecology. 

Taken together, the accumulation of population samples that have been analyzed 
with modern genomic methods has greatly improved our understanding of the 
pangenome, and its ecological significance. The totality of loci in a sample includes 
the essential core, together with a set of accessory loci that have a range of ecological 
and evolutionary significance: from functional genes with direct relevance to niche 


Structure and Dynamics of Bacterial Populations: Pangenome Ecology 125 


such as those described in the reverse ecology approach of Shapiro and Polz, to more 
selfish elements such as toxin-antitoxin systems. One feature of the current land- 
scape of bacterial genomics that is not often noticed, is that for all the references in 
the literature to “Whole Genome Sequencing," few studies actually determine the 
whole, i.e., finished genome including all plasmids. Our current understanding is 
overwhelmingly based on high-quality draft, not finished, genomes. The emergence 
of long-read technologies is changing this, and as they improve and become more 
economical (together with more methods for making hybrid assemblies from short- 
and long-read data) we may find that our current understanding underestimates the 
actual quantities of sequence variation in bacteria and that there are short regions 
under strong selection that accumulate rapid change and are hence hard to assemble 
from short-read data. Adding these is just one of the exciting directions for research 
over the next few years, which is sure to improve our understanding of pangenomes 
and their significance far beyond our current knowledge. 
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Abstract The comparison of multiple genome sequences sampled from a bacterial 
population reveals considerable diversity in both the core and the accessory parts of 
the pangenome. This diversity can be analysed in terms of microevolutionary events 
that took place since the genomes shared a common ancestor, especially deletion, 
duplication, and recombination. We review the basic modelling ingredients used 
implicitly or explicitly when performing such a pangenome analysis. In particular, 
we describe a basic neutral phylogenetic framework of bacterial pangenome micro- 
evolution, which is not incompatible with evaluating the role of natural selection. 
We survey the different ways in which pangenome data is summarised in order 
to be included in microevolutionary models, as well as the main methodological 
approaches that have been proposed to reconstruct pangenome microevolutionary 
history. 
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1 Atomic Events in Bacterial Microevolution 


Bacterial microevolution is the study of the evolutionary forces that shape the 
genetic diversity of a natural population of bacteria. This evolutionary process 
takes place as a result of the genetic changes happening within each of the genomes 
of the bacterial cells constituting the population. Over time, these changes are 
amplified or weakened by the effects of both genetic drift and natural selection. 
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Genetic drift represents the evolution caused by the death and birth of cells in the 
bacterial population, and it acts at random on all genetic variants (Charlesworth 
2009). The effects of genetic drift are higher when the population size is small, and 
so it could be thought given the large number of cells in bacterial populations that 
genetic drift would be weak. However, bacterial populations sometimes go through 
punctual bottlenecks during which genetic drift has a large effect, for example during 
transmission of pathogens from one host to another (Didelot et al. 2016). It is also 
believed that the strong structure of bacterial habitat, sometimes at the microscale 
can lead to much smaller effective population sizes than intuition suggests (Vos et al. 
2013). Natural selection on the other hand acts in a nonrandom fashion, amplifying 
some variations and suppressing others, and is a very potent evolutionary force in 
shaping the diversity of bacterial species (Petersen et al. 2007; Buckee et al. 2008; 
Pepperell et al. 2013). 

The genetic changes occurring on a single bacterial cell can be classified into 
mutation and recombination events, and the events of interest differ whether the 
focus is on the core genome (the regions shared by all genomes in the population) or 
the accessory genome (the regions that are found in some but not all of the genomes). 
As far as the core genome is concerned, the main type of mutation is the point 
mutation, whereby a single nucleotide is replaced, and the main type of recombina- 
tion is called homologous recombination, in which a relatively short fragment of the 
genome is replaced with a homologous fragment coming from another bacterial cell 
(Didelot and Maiden 2010). There are three biological mechanisms that can lead to 
homologous recombination, namely conjugation (where two bacterial cells come in 
contact so that DNA can be transmitted from donor to recipient), transduction (where 
a phage acts as vector from donor to recipient) and transformation (where naked 
DNA is picked up by the recipient from the environment, possibly following the 
death of the donor cell) (Thomas and Nielsen 2005). But since their outcomes are 
hard to distinguish this diversity of mechanisms is usually ignored in evolutionary 
models of homologous recombination. 

Point mutation and homologous recombination events clearly act on the evolution 
of the accessory genome in the same way as they do for the core genome. However, 
they do not change the genetic content of core genomes. There are two types of 
endogenous mutations that can alter the genetic content of a genome, duplication and 
deletion, and they can be thought of as opposite forces, with the former increasing 
the number of copies of a gene by one and the latter decreasing it by one. Finally, the 
accessory genome is also subject to non-homologous recombination, where a bac- 
terial cell imports a DNA fragment from another cell and inserts it in its genome, 
without overwriting a previously existing homologous fragment (Ochman et al. 
2000). Non-homologous recombination is often called lateral gene transfer or 
horizontal gene transfer, and in this chapter we will be using these three terms 
interchangeably. It should be noted, however, that this terminology is not always 
consistently used in the literature, with some authors using the term horizontal gene 
transfer to refer to both homologous and non-homologous recombination. 

The three biological mechanisms mentioned above for homologous recombina- 
tion (conjugation, transduction, and transformation) can lead to non-homologous 
recombination and once again, it is helpful when studying the bacterial 
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microevolution of the accessory genome to set aside the mechanism at play. Like- 
wise, genetic duplication and deletion can have multiple causes that we will not 
explore. It should in fact be noted that even though it is useful to present and study 
them as separate, the atomic evolutionary events briefly described above are not 
biologically independent (Lawrence 1999; Everitt et al. 2014; Oliveira et al. 2017). 
For example, a single event of recombination could involve the replacement of some 
genes (homologous recombination), the insertion of new genes (non-homologous 
recombination) and the loss of some other genes (genetic deletion). 

Furthermore, non-homologous recombination can sometimes be duplicative if the 
newly imported material is homologous to a sequence found somewhere else in the 
genome. In this case, the number of copies of the genes concerned is increased by 
one, as in a duplication event. The evolutionary distance between donor and 
recipient of such a non-homologous recombination event is then a crucial factor: if 
this distance is small the effect is similar to a duplication event, which can be seen as 
a transfer event where recipient and donor are the same organism. If distance 
between organisms is high, then the difference between the newly imported copy 
and the copy already present will likely be high too, providing a clear sign that 
duplication was not involved. This situation is analogous to the detection of homol- 
ogous recombination in the core genome, where events from a closely related source 
do not leave a trace, or perhaps involve just a single substitution in which case they 
are undistinguishable from point mutation (Didelot et al. 2010). 


2 Neutral Phylogenetic Framework of Bacterial Pangenome 
Microevolution 


2.1 Challenges with a Comprehensive Model 


The microevolutionary events that act on the bacterial pangenome, as briefly 
described above, can be combined into an evolutionary model of how the 
pangenome evolves over time. Let us consider a comprehensive model, which 
would account for the whole population of bacterial cells, including the fact that 
cells die and reproduce over time (so that genetic drift is included) and that various 
selective pressures are exerted. In this model, the genome of each cell is affected by 
various mutation and recombination events, all of which happens at a certain rates 
over time for each cell. All the rates involved in this model (birth and death of 
individuals, selection for specific variants and various evolutionary events) would 
not be assumed to be constant, but would be allowed to vary over time. This model 
falls in the class of forward-in-time models, due to the fact that it considers evolution 
as it unfolds over time, and famous examples of such models in the general 
population genetics literature are the Wright-Fisher model (Fisher 1931; Wright 
1931) and the Moran model (Moran 1958). Figure 1 illustrates such a forward-in- 
time model of pangenome evolution. 
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Fig.1 Illustration of the forward-in-time evolution of a bacterial population and its pangenome. At 
each time step, an individual is removed and another gives birth as in a standard Moran model. 
Furthermore, at each time step the accessory genome of each individual may evolve via deletion 
(orange cross), duplication (red square) and recombination (orange arrow) 


The idea of this comprehensive model is to replicate exactly the processes that we 
know occur in nature, so that it is of the highest possible realism. However, a 
comprehensive model would also integrate the diversification process of the whole 
community of microorganism found at a given spot, with the impact of their biotic 
interactions and genetic exchange, but most importantly, of the competitive process 
leading to natural selection of the fittest. Such level of description of natural 
processes would render the model impossible to use, and that is why it is not 
found in the literature. It is educative though to ask ourselves why this model is 
unusable, as this will guide us towards more practical models that feature some of 
these ideal properties. 

The first problem with this comprehensive model is computational: it would 
require very large amounts of computer memory to store the state of a population 
at a single time point, even much more so to track its evolution over time, and an 
equally impossible amount of computer power to consider the evolutionary events 
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happening to all members of the population. But even more importantly, there is 
statistical problem with the comprehensive model, in the sense that there are too 
many unknown quantities involved, for which we would not be able to take even a 
very rough guess at what their value might be. Therefore, even if the computational 
problems could be overcome, and analysis conducted under the comprehensive 
model, the results would be worthless since the quantities to be estimated would 
be unidentifiable. Simplifications will therefore have to be made to reach a model 
that has practical use, with the best model being not the most comprehensive one, but 
the one that achieves the best trade-off between biological realism on one hand, and 
computational and statistical considerations on the other hand. 

Beyond the degree of complexity of a model and the search for a trade-off 
between computational efficiency and model realism, models may rely on different 
conceptual formalisation of bacterial genomes and their evolutionary process. These 
different concepts will generate different approaches and methods that are in general 
complementary. We will thus present different elements of phylogenetic models of 
pangenome evolution, which flavours may be combined to provide a variety of 
practical models. 


2.2 Analysing Selection Based on Neutral Models 


Perhaps the greatest challenge posed by the comprehensive model above would be 
its attempt at encompassing the role of natural selection. As previously mentioned, it 
is clear that natural selection plays a crucial role in shaping the microevolution of the 
pangenomes of bacterial populations, but the effect of this force is different for all 
genes or nucleotidic site and their allelic variants, may vary significantly over time, 
and be different for different segments of the populations, for example if some 
lineages are adapted to a certain environment. Such adaptation of a lineage will 
involve many traits distributed in the pangenome of that population, and new 
mutation arising in this background might interact with it; this leads to complex 
epistatic (i.e. non-additive fitness) interactions between genomic traits, affecting the 
probability of selecting new genetic variants in one or another genomic back- 
ground—a process that could add infinite degrees of complexity to the exhaustive 
model. Model design can, therefore, be greatly simplified by considering no effect of 
selection, or in other words neutrality of evolution. 

Even if the role of selection is not explicitly included in a model, it does not mean 
that analyses based on this model are completely uninformative about selection. 
Neutral evolutionary models provide a framework to search for evidence of natural 
selection. This can be achieved formally by contrasting observed patterns in com- 
pared genomic data to expectations under neutral models. Another approach to 
detect selection is to fit a neutral model to genomic data, having heterogeneous 
parameters to describe the evolution process of each species lineage and/or gene 
family; outlier species or genes with 'abnormally' high or low parameters can 
provide a clue to non-neutral processes taking place. Similarly, the identification 
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of historical changes of processes (e.g. acceleration or slowing down of diversifica- 
tion rates) in the scenario of pangenome evolution can provide strong clues of 
selection affecting the species lineage or gene under focus (Boussau et al. 2004; 
David and Alm 2011; Lassalle et al. 2017). 

This approach is similar to the way that the role of selection is being investigated 
in the core genome. In this more frequently explored setting, a typical pipeline 
(Hedge and Wilson 2016) involves reconstructing a phylogenetic tree, classifying 
substitutions in terms of whether they are synonymous or not and estimating 
evolutionary rates so that selection tests can be applied based on variations in the 
rates of synonymous and non-synonymous substitutions along the genome (Wilson 
and McVean 2006; Castillo-Ramirez et al. 2011) or between populations (McDonald 
and Kreitman 1991; Vos 2011). In this popular approach, the evolutionary models 
used to build the phylogenetic tree and reconstruct substitution events are purely 
neutral, but still lead to invaluable insights into the natural selection process. 


2.3 Phylogenetic Approach 


A neutral version of the comprehensive model described above would still be 
impossible to use in practice. A major difficulty is that it considers the evolution 
of the population forward-in-time, so that every single cell in the population has to be 
included. However, any dataset we may have available for analysis will only include 
a small fraction of the population, sampled typically at a single time point (or a few 
recent time points in the best-case scenario). However, considering the evolution of 
the whole population over time can appear wasteful, since most cells in the past 
would not have had any descendants surviving in the present-time sample. A much 
more tractable approach is therefore to only consider the genealogical process of the 
sampled genomes, which is a backward-in-time process. Under relatively mild 
assumptions, and without introducing too much approximation, this genealogical 
process can be described without reference to the whole forward-in-time process. In 
particular, the coalescent model (Kingman 1982) describes the genealogical process 
of a population following either the Wright-Fisher or the Moran model of forward- 
in-time evolution. Extensions of the basic coalescent model have been derived to 
deal with fluctuating population size (Griffiths and Tavare 1994), homologous 
recombination (Griffiths and Marjoram 1997), which for bacteria is analogous to 
gene conversion (Wiuf and Hein 2000), and many other forms of relaxation of the 
assumptions ruling the evolutionary process (Donnelly and Tavare 1995; Nordborg 
2001; Rosenberg and Nordborg 2002). 

Considering this genealogical process, and the ability to reconstruct it with 
relatively high accuracy from genome sequences, is pivotal to lead to a usable 
model of pangenome evolution. A simple approach is to focus on core genome 
elements and apply a standard phylogenetic method typically based on maximum 
likelihood or Bayesian inference under an evolutionary model of neutral point 
mutations (Yang and Rannala 2012). Bacteria reproduce clonally and most species 
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recombine relatively rarely (Vos and Didelot 2009; Yang et al. 2018), so that this 
simple approach can often be sufficient for our purposes. Phylogenetic methods have 
also been developed that can account for the effect of homologous recombination 
while still reconstructing a single tree (Didelot and Wilson 2015; Croucher et al. 
2015). Methods that attempt to reconstruct a graph of ancestry rather than a single 
tree are superior in principle, but rarely used in practice due to their high computa- 
tional cost (Didelot et al. 2010; Vaughan et al. 2017). 

In the context of a phylogenetic tree reconstructed from the core genome, we can 
consider the events of duplication, deletion and non-homologous recombination that 
shape the accessory genome. These events happen on the branches of the phylogeny 
at certain rates that may vary over time and lineages. Notwithstanding such 
remaining complexities, a phylogenetic model of bacterial pangenome microevolu- 
tion represents a practical approach relative to the comprehensive forward-in-time 
model. The events that affect the accessory genome are relatively rare, which results 
in a strong phylogenetic inertia of genome gene contents, i.e. a large correlation 
between gene contents and core genome-based diversity (Konstantinidis et al. 2006; 
Kislyuk et al. 2011). Ignoring this effect would lead to strong misinterpretation of 
gene distribution patterns, especially in case of a diversity bias in genome sampling, 
e.g. when surveying a pathogen epidemics where clusters of closely related strains 
occur. Modelling the pangenome evolution within a phylogenetic framework where 
evolution of the gene content takes place along the genealogical tree avoids such 
pitfalls, in the same way as a phylogenetic framework avoids false conclusions to be 
reached when performing bacterial genome-wide association mapping (Collins and 
Didelot 2018). 


3 Description of Pangenome Data for Inclusion 
in Microevolutionary Models 


3.1 Units of Pangenome Evolution 


In order to describe further the existing models of pangenome microevolution, it is 
necessary to consider the unit in which the pangenome is being described. Figure 2 
illustrates the different approaches that have been used for that purpose. The ideal 
starting point would be a complete sequence of each genome of interest, but this is 
rarely available due to repeat regions in the genomes that obscure the exact ordering 
of sequences along the genomes, at least based on short read sequencing. For that 
reason, the most frequently used data is a de novo assembly of each genome, which 
can be performed, for example using Velvet (Zerbino and Birney 2008) or SPAdes 
(Bankevich et al. 2012). This results a set of genomic regions called contigs, which 
occur in an unknown order either on chromosomes or on plasmids. 

A first approach considers a genome alignment, where every part of a genome is 
assigned to a syntenic block—segments of genome sequences that are all 
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homologous and can be aligned (Fig. 2b, c). These sequence segments can have 
boundaries falling anywhere in the genome, notably between or within protein- 
coding sequences, and the often span many genes. While this is a flexible view of 
pangenome evolution that is probably the most realistic—evolving genomes ignore 
human annotation of functional elements—it may be cumbersome to implement 
with a growing number of compared genome. Indeed, every genome added to the 
dataset may result in the breakage of a syntenic block into several parts due to 
insertion, deletions, or rearrangements in one of the homologous genome segments. 
Homologous genome segments can themselves be difficult to align at the nucleotide 
level when they include fast-evolving genome regions. For these reasons, even the 
best software for this task such as progressiveMauve (Darling et al. 2010) or 
MUGSY (Angiuoli and Salzberg 2011) can only deal with between 10 and 
100 genomes, depending on how diverse they are. This first alignment approach 
works best on the well-conserved parts of pangenomes, i.e. the core genomes and 
possibly large conserved accessory regions of the genomes. This partial sampling of 
genome sequences is practical because it allows to represent the homology between 
genomes as a concatenated alignment of all these syntenic blocks, which amounts to 
a representative map of the genome. Alternatively, a representative whole-genome 
alignment can be obtained by mapping all homologous sequences in compared 
genomes to the genome of a reference isolate, using, for example MUMmer 
(Kurtz et al. 2004). This can result in reference-biased representation, which may 
be avoided by restricting the alignment to the core genome. 

A second approach focuses on genes, or more specifically on families of homol- 
ogous genes. These are usually defined based on sequence similarity and restricted to 
protein-coding sequences, even though it can be applied to conserved intergenic 
sequences as well (Fig. 2e). In this representation, rather than a reference whole- 
genome map, we consider independent gene families, which members need not be 
localised in a genome. The diversity of the gene family can conveniently be 
represented with a phylogenetic tree based on all nucleotide positions in the aligned 
genes, which allows the estimation of statistical support (Fig. 2f). This information 
can, in turn, be used to inform the ancestral reconstruction of genome gene content 
(as discussed below). There are several ways in which this gene family content 
identification can be performed. If a representative from each family is known in 
advance, similarity search tools like BLAST (Altschul et al. 1997) can be used to 
search them in each genome, and, for example BIGSdb automates this process 
(Jolley and Maiden 2010). Alternatively, each de novo assembled genome can be 
annotated separately, using, for example Prokka (Seemann 2014) or RAST (Aziz 
et al. 2008). Homologs can then be identified by using a combination of similarity 
search between genes from different genomes (e.g. with BLAST) and similarity 
network analysis, as implemented, for example in the software OrthoMCL (Li et al. 
2003), with integrated pipelines implemented in software like Roary (Page et al. 
2015) and MMsegs (Steinegger and Sóding 2017). 

A consequence of this distinction is the way genetic exchange between genomes 
is considered. In the nucleotide-centred vision, genetic exchange will result either in 
the replacement of a region (homologous recombination) or the insertion of a 
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sequence at a defined position in the genome map (non-homologous recombination) 
(Fig. 2c). Similarly, a genetic duplication event will consist in recopying a segment 
of genome sequence and inserting it next to its template (tandem duplication) or 
away from it. Homologous recombination events can be evidenced based on a scan 
of the genome map, looking for increased or decreased sequence similarity 
(or phylogenetic relatedness) between compared genomes along the genome map 
(Didelot et al. 2007). Non-homologous recombination and duplication events consist 
of insertion events and are simply evidenced by some region being only represented 
in some genomes in the alignment—the others featuring a large ‘gap’, or long string 
of missing characters. Distinguishing non-homologous recombination from dupli- 
cation events can be tricky: even comparing the inserted segment to the rest of the 
genome and finding a similar region is not conclusive that it would be the duplication 
template (or copy) of the studied region. Such a pattern could also result from a 
recombination with a related organism leading to the insertion of genetic material 
that had homologous counterpart already residing in the recipient genome. Not 
finding a similar region is not conclusive of the insertion resulting from a 
non-homologous recombination event either, as an ancient duplication followed by 
a loss (or deletion) in the compared lineage may result in the same pattern. The 
answer to this conundrum is modelling of the possible sequences of events, or 
scenarios, and determine the most likely based on patterns of sequence divergence. 

In the second approach centred on genes, the exchange of genetic material is 
made most evident in the phylogeny of genes, or gene tree, because the gene from 
the recipient will be more closely related to genes from the donor than to genes from 
closely related species. In this context, the event is rather called horizontal gene 
transfer, in opposition to vertical evolution, which would have resulted in the 
‘normal’ clustering of genes from closely related species. Again, this representation 
ignores the locus where genes sit, and it is therefore not straightforward to know 
from the gene tree whether the horizontal gene transfer event resulted in the 
replacement of a resident sequence or in the addition of a new one. 

There are also other evolving units that can be considered as the basis of 
pangenome microevolution modelling, including conserved protein domains or 
short sequences of a constant length, which are also known as words, features, or 
k-mers (Sims and Kim 2011; Sheppard et al. 2013b). Some units may seem more 
natural than others from a theoretical point of view, but in practice all units have pros 
and cons, and the choice of unit is guided by the evolutionary resolution required by 
each pangenome investigation. 


3.2  Granularity of Homologous Groups 


When modelling the pangenome diversity with homologous gene families, a further 
distinction can be done on which homologous link to consider clustering genes into 
families. A popular approach is to consider orthology relationships. In theory, genes 
are orthologous when they are related only by events of speciation (i.e. diversification 
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of the whole genome), not by duplication of horizontal gene transfer. Because the true 
course of gene diversification events is unknown, we must rely on practical defini- 
tions of orthology. This theoretical definition implies that two orthologues cannot 
occur in the same genome. A usual criteria is thus to look for the bidirectional best hit 
(BBH) in a similarity search of all the genes in a genome against all of the genes in 
another genome pairwise genome (Altenhoff and Dessimoz 2009). 

This pairwise relationship can be used to build a network of genes covering the 
whole pangenome dataset, where cliques (groups where the found relationship is 
transitive among members) are recognised as clusters of orthologous genes or COGs 
(Tatusov et al. 1997). Using this practical definition, it is straightforward to classify 
any gene into a cluster, many of which will however be clusters of genes on their 
own: orphan genes with no homologues, but also those resulting from a recent 
transfer or duplication. By construction, these COGs can only be absent or present 
in a single copy in a genome, which proves very convenient for representing the 
distribution of genes in the pangenome by a genome-to-COG binary matrix filled 
with zeros and ones. This representation can be handled by many simple methods 
that model the transition between these binary states over the tree of the genomes, 
i.e. events of gene gain and loss (Fig. 3a) (Mirkin et al. 2003). This approach has 
been widely used, but suffers from its stringent definition that leaves many homol- 
ogous genes out of COGs under scrutiny, which may strongly flaw the inferred 
ancestral genome gene contents and the derived conclusions on ancestral functional 
repertoires. 

Instead, it is possible to consider a whole family of homologues, which distribu- 
tion of the family in genomes can again be represented in a matrix of counts, where 
this time values range from zero to any integer number. Models of pangenome 
evolution can account for this multiplicity of gene copy number by invoking extra 
gene gain events (Fig. 3b) (Csurós 2008; Csurós and Miklós 2009). The nature of 
these gain events—duplication or horizontal gene transfer—is, however, not inferred 
as it fundamentally requires to know the phylogenetic relationships between genes 
within a homologous family. 


3.3 Linkage of Genes and Syntenic Blocks 


Notwithstanding the type of evolving unit considered (aligned genome segment or 
gene family), all units are usually considered to evolve independently on the 
phylogeny. This is, however, not always realistic given the high linkage disequilib- 
rium found in bacterial genome—the non-independence of physically linked char- 
acters in evolution, a consequence of their clonal mode of reproduction. 

Linkage can be introduced in a pangenome evolution model by specifying the 
location of genes on a genomic map. The evolution of genes on the map is then 
modelled through events of insertion, deletion and rearrangement. This map can 
relate the absolute position of genes in contemporary genomes (i.e. with nucleotide 
site coordinates) by chopping all genomes in the dataset into syntenic blocks, where 
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homology and overall gene order between genomes is conserved (Vallenet et al. 
2006; Darling et al. 2010). However, as mentioned above, the larger the genome 
sample, the more syntenic blocks will split and shrink. Based on such genome maps, 
the history of each syntenic blocks can be estimated, describing the ancestral events 
of pangenome evolution. Even though in theory the map evolves over time due to 
genome rearrangements (Darling et al. 2008), in practice the maps are assumed to be 
constant in order to allow to focus on fine-grained changes within the syntenic 
blocks. This assumption is commonly made, for example when investigating homol- 
ogous recombination in the core genome (Didelot et al. 2010). 

Another option is to map the relative position of smaller evolving units (usually 
gene families) in each genome of the dataset. Such a relative map can be represented 
by a matrix of presence or absence of a direct adjacency between genes in a given 
genome, contemporary or ancestral. This more abstract representation allows the use 
of incomplete data, such as draft genome assemblies, where the physical linkage of 
sequences is not fully or not unambiguously documented. The evolution of gene 
neighbourhood is modelled by invoking events of creation and breakage of adja- 
cencies between neighbour genes, thereby modelling any insertion, deletion and 
rearrangement. Ancestral state reconstruction (see below) is then undertaken, by 
estimating a genome map at each ancestral node of a species phylogeny (Fig. 3d) 
(Bérard et al. 2012; Patterson et al. 2013; Duchemin et al. 2017). These models are, 
however, quite heavy computationally and may become overwhelmed by large 
structural diversity in the dataset. 


4 Methodological Approaches to the Reconstructing 
Pangenome Microevolution 


4.1 Ancestral State Reconstruction 


The inference of ancestral genomes and corresponding gene gain and loss scenarios 
can be a complex and computationally intensive task, but it can also be simplified to 
the point that it becomes almost straightforward if the research questions are 
relatively simple. For example, using profiles of gene presence/absence in genomes 
and a phylogenetic tree as only input, ancestral state reconstruction can be applied to 
infer in which internal nodes of the tree the genes were present, and therefore on 
which branches the genes were gained and lost. For a general review on ancestral 
state reconstruction, see Joy et al. (2016). One of the simplest and most popular 
approach is to perform a parsimonious reconstruction, where the number of gain and 
loss events is minimised without the need to estimate any parameter (Mirkin et al. 
2003). In practice, this is more or less equivalent to performing maximum likelihood 
inference under a model in which gain and loss happen at the same small rates. 
However, probabilistic modelling of state evolution has the interesting property to 
integrate over several possible scenarios. Even a maximum likelihood point estimate 
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of the presence of a gene at a given ancestral node will therefore consist of a 
non-binary probability, a nuanced result allowing to consider the uncertainty in the 
ancestral reconstruction (Pagel 1999). A similar Bayesian approach is stochastic 
character mapping (Huelsenbeck et al. 2003), which consists in sampling gain and 
loss histories from their posterior probability distribution via a Monte Carlo method. 

Ancestral state reconstruction is particularly well suited to analyses focused on 
specific genes rather than the whole pangenome, for example analysing the gain and 
loss of pathogenicity genes (Dingle et al. 2014) or resistance genes (Ward et al. 
2014). It can also be applied more generally to all genes in a pangenome, and the 
rates of gain and loss would typically be assumed to be equal meaning that the 
genome size is at equilibrium (Touchon et al. 2009). Alternatively, the reconstruc- 
tion can be based on genomic elements known to be gained and lost in one block, 
such as bacteriophages, plasmids, and integrative conjugational elements (Zhou 
et al. 2013). This represents one simple way of dealing with the linkage of genes 
mentioned previously, although at the cost of potentially losing information about 
the gene content evolution of the genomic elements assumed to be perfectly linked. 
At the other end of the spectrum, the reconstruction can be based on smaller elements 
than genes, for example k-mers, but in this case it becomes vital to relax the 
assumption of a fixed clock on gain and loss, for example using a local clock 
model (Didelot et al. 2009) as illustrated in Fig. 4. This technique has been applied 
to the pangenomes of Escherichia coli (Didelot et al. 2012) and Campylobacter 
jejuni (Sheppard et al. 2013a), showing in both cases a strong relationship between 
evolution of the accessory genome via gain and loss events and evolution of the core 
genome via homologous recombination. 

An important drawback of ancestral state reconstruction methods is that they 
ignore the nature (recombination or duplication) and origins (recombination donor) 
of gene gain events, which can yield partial and inaccurate scenarios when the true 
history is complex, especially with many recombination events. In particular, the 
exploitable signal from a profile of gene presence/absence in extant genomes are 
quickly saturated when several gene copies coexist in a genome, and likely descend 
from separate past events. This issue can sometimes be tackled by defining strict 
families of orthologs, where every gene is present in one copy or none, but at the cost 
of losing the information on evolution of homologues. Ancestral state reconstruction 
could also in principle be applied to data on family of homologues, where each 
genome can contain zero, one or more copies of a gene. This would require to fit a 
ladder model similar to the ones used when analysing microsatellite data (Ohta and 
Kimura 1973; Wilson and Balding 1998). This approach is difficult in practice 
because bacterial accessory gene families of interest have often too complex histo- 
ries to reliably infer orthologous groups and have high gain and/or loss rates that 
quickly saturate signals. It has, however, been applied successfully in studies where 
genomes of single representatives from fairly distant species were compared, thus 
ignoring the ‘messy’ variation introduced by within-population evolution (Csurós 
and Miklós 2009). 
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Time 


Fig. 4 Illustration of a pangenome gain and loss model with local clock. The clonal genealogy is 
shown in black. The width of the red block on the left of the branches is proportional to the rate of 
gain. Similarly, the blue block on the right of each branch represents the rate of loss. Both the gain 
and loss rates occasionally change over time. Individual gain events are represented by red arrows, 
and individual loss events are represented by blue arrows 


4.2 Phylogenetic Reconciliation 


To deliver more informative scenarios of evolution, it is necessary to know the origin 
of gene gains, which effectively means to know the relationship between observed 
genes. Gene tree versus species tree reconciliation methods compare the topologies 
of phylogenetic trees built from individual gene sequences against a reference 
species tree (Maddison 1997). In the context of pangenome analysis, the species 
tree is a phylogenetic tree reconstructed from the whole of the core genome. Species 
and gene trees often have inconsistent topologies, which could happen by chance, 
especially since the gene tree typically has limited statistical support, or may be the 
result of evolutionary events affecting the history of the gene relative to the clonal 
history. Reconciliation methods intend to explain the significant topological discords 
by events of gene duplication, transfer, or loss (Szollosi et al. 2015). Figure 3c 
illustrates the principles behind reconciliation methods. Practically, both trees are 
annotated with the inferred events, such that there is a full agreement on the course of 
events, from the root of the gene lineage to the contemporary distribution of genes in 
genomes—thus reconstructing the path of evolution and diversification of genes in 
the clonal frame of genome evolution. As a result, this approach allows to explicitly 
determine the donors and recipients of transferred genes, or the ancestor in which a 
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gene was duplicated. Methods for pangenome reconciliation analysis have been 
proposed based on parsimonious reconstruction (Abby et al. 2010; Jacox et al. 2016) 
and probabilistic models (Szollosi et al. 2012, 2013). 

The ancestral state reconstruction approach and the reconciliation approach have a 
lot in common, and the latter can be thought of as a natural extension of the former 
when observation is not limited to presence or absence or number of copies of a gene, 
but also includes the phylogenetic relationships between genes from separate 
genomes. Reconciliation methods are therefore superior in the sense that they exploit 
more of the available signal, but they are also much more challenging to implement 
computationally and have so far been limited to analysis of a handful of genomes. 
Ancestral state reconstruction methods are currently more popular but we predict that 
reconciliation methods will become increasingly widespread in the near future with 
the development of more effective statistical methods. Beyond the study of the atomic 
events whereby the pangenome evolves, both methods allow to infer ancestral states 
in hypothetical ancestors, or in other words to reconstruct ancestral genomes. Doing 
so, one can derive the expected phenotypic traits of the ancestors—antimicrobial 
resistance, metabolic activities, even ecological lifestyle. These inferred traits can 
then be compared to historical records of Earth evolution or pathogen epidemic 
spread to try and find causal relations between biological activity and the course of 
events (David and Alm 2011; Holden et al. 2013), or be considered in support of 
further ancestral reconstruction, such as scenarios of ecological niche colonisation 
(Lassalle et al. 2017). 
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Abstract The evolution and structure of prokaryotic genomes are largely shaped by 
horizontal gene transfer. This process is so prevalent that DNA can be seen as a public 
good—a resource that is shared across individuals, populations, and species. The 
consequence is a network of DNA sharing across prokaryotic life, whose extent is 
becoming apparent with increased availability of genomic data. Within prokaryotic 
species, gene gain (via horizontal gene transfer) and gene loss results in pangenomes, 
the complete set of genes that make up a species. Pangenomes include core genes 
present in all genomes, and accessory genes whose presence varies across strains. In 
this chapter, we discuss how we can understand pangenomes from a network 
perspective under the view of DNA as a public good, how pangenomes are 
maintained in terms of drift and selection, and how they may differ between prokary- 
otic groups. We argue that niche adaptation has a major impact on pangenome 
structure. We also discuss interactions between accessory genes within genomes, 
and introduce the concepts of ‘keystone genes’, whose loss leads to concurrent loss of 
other genes, and ‘event horizon genes’, whose acquisition may lead to adaptation to 
novel niches and towards a separate, irreversible evolutionary path. 
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1 Introduction 


Horizontal Gene Transfer (HGT) is the most important force affecting evolutionary 
change in prokaryotes, and its pervasiveness has resulted in a vast global network of 
connectivity between microorganisms. DNA is available for horizontal acquisition 
by prokaryotes in a variety of ways: conjugative plasmids (Grohmann et al. 2003; 
Lederberg and Tatum 1946) facilitate the transfer of DNA directly from cell to cell, 
phage can facilitate the indirect movement of DNA from one prokaryotic cell to 
another by generalised transduction (Zinder and Lederberg 1952), and gene transfer 
agents (GTAs) facilitate gene transfer by cell lysis. In some Archaea, we even see the 
formation of networks of connections between individuals that can lead to the 
formation of heterodiploid cells and recombination between the parental cells’ 
genomes (Naor and Gophna 2013). Another important mechanism is direct acquisi- 
tion of DNA through transformation. Extracellular DNA has a ubiquitous distribu- 
tion in natural environments from hydrothermal vents, to freshwater, soil, and 
sediment (Nagler et al. 2018), as well as in the biofilms (Steinberger and Holden 
2005) that line our sewage pipes (Vincke et al. 2001), contaminate hospital equip- 
ment (Stickler 2008), associate with tooth decay (Potera 1999), and much more. 
Therefore, DNA can be shared and used among organisms and effectively becomes a 
public good. All these mechanisms result in a DNA-sharing network that has 
probably existed since before life evolved to become cellular and will likely remain 
an important part of prokaryotic biology for as long as there are prokaryotes. 

With the advent, and subsequent accessibility, of next-generation sequencing 
technologies (Shendure et al. 2017), it became apparent that gene presence—absence 
variability within a species (i.e. strain-to-strain variability) was much larger than 
expected (Tettelin et al. 2005). For example, when the first three Escherichia coli 
genomes were sequenced, only 39.2% of their protein-coding genes were found to 
be common to all three genomes (Welch et al. 2002). In a more recent study 
involving 1524 Pseudomonas aeruginosa genomes, only 3% of genes were found 
to be shared (i.e. *core") across all strains, with the remaining 97% being variably 
present in a subset of strains (Karasov et al. 2018). The existence of this variability in 
gene content within what we regard as single prokaryotic species led to the concept 
of a pangenome, the complete set of genes that are present in a given species 
(Tettelin et al. 2005). This set of genes is usually divided into two categories: core 
genes, that are present across all individuals in a species, and accessory genes, whose 
presence varies between individuals or strains (Tettelin et al. 2005; Welch et al. 
2002; Karasov et al. 2018; Laing et al. 2010). The pangenome concept revolutionises 
our thinking, since it means considering organisms like Escherichia coli not only in 
terms of the thousand or so genes that are common to all members, but also in terms 
of the 100,000 or so genes that are found in at least one, but not all, E. coli genomes 
(Land et al. 2015). This new information on the structure of the prokaryotic world 
has meant that we have to think about ‘units’ of selection (Okasha 2006) in different 
ways. In this chapter, we will outline some of the ways in which we can think about 
pangenomes and what this means for biology. Although our focus is on prokaryotes, 
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it should be noted that some eukaryotes also have pangenomes. For example, a high 
degree of gene presence—absence polymorphism has been found in different genome 
sequences of humans (Sherman et al. 2019), cultivated rice (Wang et al. 2018; 
Hubner et al. 2019), sunflower (Hubner et al. 2019), and in the widespread 
coccolithophore Emiliania huxleyi (Read et al. 2013). 


2 Pangenome Properties 


As a consequence of the merging of genetic information through HGT and the 
existence of pangenomes, our thinking about the evolutionary history of prokaryotic 
genomes has changed. In fact, it is more relevant to think not of the evolutionary 
history of a genome, but rather the evolutionary histories of the various parts of a 
genome, since these histories can be different (Bapteste et al. 2009). The phyloge- 
netic relationships inferred by a single gene, no matter how important that gene, 
rarely reflects the evolutionary history of the suite of organisms under consideration. 
This idea was codified by Darwin in ‘The Origin’ when he said: ‘The importance, for 
classification, of trifling characters, mainly depends on their being correlated with 
several other characters of more or less importance’ (Darwin 1860). In other words, 
the notion of homoplastic characters (i.e. characters whose similarity is due to 
convergent evolution) is an old idea and characters can differ in what they suggest 
is the proper classification of an organism. Though Darwin did not know about DNA 
or HGT, the warning about character congruence and classification still holds true 
today and perhaps even more so because of HGT and the non-tree likeness of this 
process. 

The pangenomes of different prokaryotic groups differ. Transformation, trans- 
duction, and conjugation contribute to shuffling variably sized portions of genomes 
through both homologous and non-homologous recombination. The frequency of 
the different mechanisms likely depends on environmental conditions, lifestyle, and 
cell biology (i.e. the molecular mechanisms present in particular cells or taxa) 
(Hanage 2016). Therefore, under different conditions, HGT and recombination can 
in principle range from non-existent to widespread, resulting in primarily clonal or 
panmictic groups, respectively (Yang et al. 2019). Furthermore, recombination 
barriers, both within and between species, can be fuzzy and potentially differ for 
different parts of the genome. This can make the delineation of populations or of 
species more complicated in prokaryotes, when compared to animals, for example 
(Hanage 2013). However, it has been suggested that natural species boundaries do 
exist in prokaryotes and that they can be defined (Bobay and Ochman 2017). On the 
whole, HGT and DNA recombination in prokaryotes can have similar consequences 
to sexual reproduction in eukaryotes: removing deleterious mutations, thereby 
avoiding Muller’s ratchet or mutational meltdown, while also offering a mechanism 
for bringing together advantageous mutations in different genes or parts of the 
genome. But crucially in prokaryotes, recombination can both remove and add a 
hugely variable number of genes to a genome, thereby affecting the overall gene 
repertoire rather than simply modifying existing genes by point mutation. That is, 
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Fig. 1 An illustration of how the rate at which new accessory genes are discovered as increasing 
numbers of genomes are sequenced. For species with open pangenomes, the rate of accessory gene 
discovery continually increases, while for closed pangenomes, this rate plateaus quickly 


recombination in prokaryotes often results in insertions or deletions, while in 
eukaryotes it tends to swap alleles between chromosomes. 

Pangenomes differ in the degree to which they are ‘open’ or ‘closed’. Species that 
share almost all genes with each other (i.e. have very little strain-to-strain gene 
content dissimilarity) having a large ‘core’ and small ‘accessory’ genome, are 
considered to have closed pangenomes (McInerney et al. 2017). In contrast, species 
can have open pangenomes in which gene content varies appreciably from one 
genome to another (McInerney et al. 2017) (see Fig. 1). Though we know the 
openness of prokaryotic pangenomes varies greatly from one species to the next 
(Tettelin et al. 2005), our estimates of openness can be affected by the available 
genomic data (i.e. the number of accessory genes is expected to increase as more 
strain information becomes available). As such, openness can be measured by 
modelling the number of accessory genes as a function of the number of sequenced 
genomes (Tettelin et al. 2005) (see also Chap. 1). The first analysis of openness found 
that eight Streptococcus agalactiae genomes were not enough to uncover all possible 
accessory genes and predicted that new genes would be found with every additional 
genome, leading to an essentially infinite pangenome. In contrast, the number of new 
accessory genes in Bacillus anthracis dropped to zero after the incorporation of only 
four genomes to the study of its pangenome (Tettelin et al. 2005). Therefore, accurate 
measurements of pangenome openness depend on sampling the broad diversity of 
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genomes in a given species, and such measurements should ideally account for core 
genome diversity and the phylogenetic relationships between those genomes. 


3 Public Goods 


The idea that DNA functions as a public good (Erwin 2015; McInerney et al. 201 1a; 
McInerney and Erwin 2017) stems from the fact that HGT makes DNA available to 
other ‘users’ and this process has structured a great deal of the life on this planet, 
both cellular and viral (Bapteste et al. 2012, 2013). Integration of a new DNA 
sequence into a genome can only be successful if the source organism and the 
recipient organism can both make use of this DNA in some way. Carl Woese referred 
to the universal genetic code as being the ‘lingua franca’ of genetic commerce 
(Woese 2002). HGT has been observed in almost all known phyla, though HGT 
seems to be reduced in frequency among eukaryotes and perhaps reduced further in 
multicellular organisms (Schonknecht et al. 2014; McInerney et al. 2014; Ku et al. 
2015). As a consequence of HGT, there is no universal Tree of Life, and instead 
there is a network of life reflecting the vertical and horizontal movements of genetic 
information (Bapteste et al. 2012, 2013; Corel et al. 2018). 

Our current appreciation of evolutionary history in prokaryotes and the observa- 
tions of pangenomes has led us to consider what metaphors might be appropriate for 
representing, modelling, and understanding life on the planet. A variety of alterna- 
tives to the tree metaphor, such as ‘cobwebs of life’ (Ge et al. 2005) or ‘rhizome of 
life’ (Merhej et al. 2011), have been used. However, some of us have proposed to 
depart from a way of thinking that inherently depends on a particular kind of 
diagram. Instead we have advocated a focus on the fundamental process of HGT, 
and the fact that it constructs new genomes in the same way that, say, a furniture 
manufacturing plant might bring together different materials in order to construct a 
new kind of chair, or in the way that a football team might substitute one player for 
another. As mentioned above, Woese suggested that HGT could be thought of in 
commercial terms (Woese 2002), and a logical extension to this line of thinking is 
that DNA acts as though it is a ‘public good’ (McInerney et al. 201 1a, b; McInerney 
and Erwin 2017). Briefly, in the theory of goods, Nobel laureate Paul Samuelson 
initially described two kinds of goods thus: ‘[. . .] I explicitly assume two categories 
of goods: ordinary private consumption goods which can be parcelled out among 
different individuals [. . .] and collective consumption goods [. . .] which all enjoy in 
common in the sense that each individual’s consumption of such a good leads to no 
subtraction from any other individual's consumption of that good [. ..]" (Samuelson 
1954). Since then, the concept has been expanded so that four kinds of goods are 
recognised—private goods, public goods, club goods, and common goods 
(McInerney et al. 201 1a), based on whether goods are rivalrous and/or excludable. 
The criteria for each of the classifications are contained in Fig. 2, along with 
examples of goods that fall easily into each of these categories. A ‘good’ is said to 
be rivalrous if its consumption by one consumer prevents simultaneous consumption 
by other consumers, and a ‘good’ is said to be excludable if it is possible to prevent 
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Fig. 2 The nature of Goods. Goods can fall into four different categories—private, club, common, 
and public according to whether they are rivalrous or non-rivalrous, and excludable or 
non-excludable. The figure also gives some examples of goods that easily fall into each of these 
four categories 


others from having access to it. DNA possesses the property of being non-excludable 
(e.g. the DNA of any individual is made available, at least at the time of death of the 
cell or the individual) and it is also non-rivalrous in a practical sense, given that the 
amount of DNA that is produced by any given species cannot realistically be used up 
by any consumer. This perspective is useful in the sense that viewing genome 
evolution as a process of building functioning tools (i.e. new kinds of organisms) 
allows us to ask questions that would not make much sense if we used ‘tree-thinking’ 
(Bapteste et al. 2013; Dagan and Martin 2009). Tree-thinking inherently supposes 
that genes came to be in a genome because all the genes have been inherited through 
the same lineage of descent—a process that infers that genes are ‘private’ to a clade. 
*Goods-thinking', on the other hand, frees us to think more about why the particular 
set of genes that we observe in a genome are there, rather than some other set of 
genes. We do not assume that any gene is a private good, exclusively found in a 
particular species or clade, with other organisms excluded from accessing the 
segment of DNA. Goods-thinking infers that a genome has evolved to be the way 
it is through vertical inheritance from a common ancestor, but also through the 
horizontal acquisition of genes, with the rate of gain (and loss) of genes being 
modified by the influences of random drift, selection, and demography. Goods- 
thinking, therefore, needs some new tools, outside of the framework of the bifurcat- 
ing phylogenetic tree, in order to properly analyse gene and genome evolution 
(Bapteste et al. 2009). Here we deal specifically with the pangenome’s part of 
Goods Thinking theory. 


4 Analyses of Pangenomes 


Because of the fluidity of genomes, caused by accessory gene gain and loss, the 
analysis of pangenomes lends itself more suitably to networks than to phylogenies. 
Networks are mathematical graphs represented by nodes, or vertices, which are 
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connected by edges, or lines, if-and-only-if a relationship exists between them. 
Networks are widely used in ecology—and in biology in general—to represent, 
for example food webs (Dunne et al. 2002), social interactions (Robins et al. 2007), 
nutrient/energy flows (Allesina et al. 2005), and cooperation between members in a 
population (Jain and Krishna 2001). Networks can have edges that are either directed 
(often shown as an arrow) or undirected, depending on whether the relationship that 
connects the nodes has directionality (e.g. to connect an organism to their food 
source in a food web). The study of networks, or graphs (i.e. graph theory) dates to at 
least 1735 (Skiena 2008; Compeau et al. 2011) and has advanced rapidly due to its 
applications in computer science, engineering, physics, and biology. The public 
goods nature of DNA makes a network structure ideal to uncover patterns and 
processes of evolution in ways where phylogenetic trees would be somewhat 
lacking, since phylogenetic trees do not infer lateral movement of genetic material. 
The analysis of features contained within the graphs such as non-transitive triplets, 
or nodes with identical incident edges can reveal patterns of recombination or gene 
sharing (Bapteste et al. 2012; Corel et al. 2018; Meheust et al. 2018). 

In the analysis of pangenomes, networks are often k-partite or multi-partite, 
meaning that their nodes can be coloured using k colours such that no node is 
directly connected to another with the same colour (Pavlopoulos et al. 2018). A 
special case of k-partite graphs is bipartite or two colourable graphs. In pangenome 
analyses, bipartite graphs usually connect genomes to their constituent genes (Corel 
et al. 2018). Bipartite networks have been used previously to identify the levels of 
gene sharing within microbial genomes (Corel et al. 2018), to characterise the 
capacity of accessory genes in metabolic networks (Goyal 2018), and to interrogate 
gene presence/absence patterns and coincident relationships (McNally et al. 2016). 

Especially relevant for genome evolution is the N-rooted fusion graph (Haggerty 
et al. 2014). This graph differs from a phylogenetic tree due to the presence of more 
than one root node (a node that depicts the point-of-origin of all operational taxo- 
nomic units in the graph) and the presence of at least one internal node in the graph 
where the in-degree of the node (the number of edges pointing towards that node) is 
greater than 1 and the out-degree of the node (the number of edges emerging from that 
node) is 1 (Fig. 3). In other words, the merging of genetic material inherently means 
that the graph needs more than a single origin or root. It also means that the point at 
which the material merged must be represented by a merger, or fusion node (Fig. 3). 
The various components of the internal structure of an N-rooted fusion graph can be 
determined by the usual phylogenetic approaches [i.e. parsimony, likelihood, or 
distance matrix methods (Felsenstein 2003)]. The complete N-rooted graph is then 
constructed by merging of these individual phylogenetic trees, by constructing fusion 
nodes at the appropriate places (Haggerty et al. 2014). 
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Fig.3 An N-Rooted Fusion 

Graph. This kind of 

branching diagram can be 

used to illustrate the 

merging of evolving objects. 

The nodes labelled R R 
indicate the root nodes for 

this graph. Each root node 

depicts the root for a 

different kind of gene. The 

node labelled F indicated the F 
fusion node. The different 

node colours indicate 

different gene families, with 

the blue nodes indicating R 
that they are a fusion family 


5 How Are Pangenomes Maintained? 


Because acquired DNA can function across multiple organisms—facilitating it to 
become a public good—HGT into some individuals in a population creates diversity 
within that species. Transferred sequences will be present in a subset of the 
population’s genomes and absent in the rest (McNally et al. 2016), becoming raw 
material for natural selection (see Fig. 4). Multiple iterations of this process have 
most likely resulted in the observed pattern of hugely varying gene content across 
conspecific genomes (Welch et al. 2002; Lukjancenko et al. 2010; Koonin and Wolf 
2008). Maintenance of the observed high levels of variation requires an explanation, 
because, while we know that transformation, conjugation, and transduction intro- 
duce this presence—absence variation, it is expected that both natural selection and 
genetic drift would remove this kind of genetic variation from populations. In terms 
of sequence variation within populations, different mechanisms have been proposed 
to explain the maintenance of diversity. These mechanisms range from relatively 
trivial explanations, such as the existence of a balance between the rates at which 
new variants arise in populations (by mutation, for example) and the rates at which 
they are removed, to more exotic mechanisms such as heterozygote advantage, 
interactions between genotypes and different environments, and negative 
frequency-dependent selection (Hahn 2018). Although most of these explanations 
have been developed in order to account for high levels of genetic diversity in 
diploid, sexually reproducing eukaryotes, some of these mechanisms can also help 
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Donor ——————> Public —————> Pangenome 


Fig. 4 Prokaryotic DNA becomes a public good upon cell death or when the DNA is taken from 
the cell via phage or plasmids. Pangenomes can then accrue via the differential acquisition of these 
public goods 


us to understand genetic variation in prokaryotes. However, understanding the 
existence and maintenance of pangenomes has its own particular challenges. 

A key element to be considered when we speak about mechanisms that maintain 
variability in gene content in prokaryotic populations is the fitness effect that these 
accessory genes have on individuals. We will likely find examples of particular 
genes whose presence is neutral, deleterious, or adaptive in most genomes; we are 
already familiar with genes in the latter class such as those conferring antibiotic 
resistance and pathogenicity islands (Sheppard et al. 2018). However, an interesting 
question to think about is whether accessory genes on average contribute to fitness 
(or under which circumstances they may be adaptive), and which mechanisms have 
led to their patchy occurrence in genomes. Depending on the average fitness effect of 
accessory genes, different mechanisms could be governing their presence. 

If accessory genes are mostly deleterious, which could be the case if they are 
predominantly selfish or parasitic, then the patchy presence patterns that we observe 
could reflect a constant arms race between these selfish elements and the host 
genome (somewhat equivalent to the Red Queen hypothesis for maintenance of 
variability in populations of interacting hosts and pathogens). Although this pattern 
may be responsible for a proportion of accessory genes, it is very unlikely that this 
explains most of the observed variability and the existence of pangenomes, partly 
because many accessory genes are not related to selfish elements and appear to be 
involved in multiple cellular functions (McNally et al. 2016; Sheppard et al. 2018). 

If accessory genes are usually neutral in terms of fitness, eventually they would be 
randomly fixed or lost in different populations due to genetic drift, particularly if 
recombination is rare. A neutralist model for pangenomes implies that we see 
presence-absence variation because there is a random ‘rain’ of genes constantly 
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being acquired and we observe their presence in a genome because they have either 
not had enough time to drift to fixation or to be lost again. This kind of model implies 
that neither the gain nor the loss of accessory genes has a fitness effect (Baumdicker 
et al. 2012), a situation that seems contradicted by the observation of both prophage 
(Nanda et al. 2015) and antibiotic resistance (Her and Wu 2018) genes affecting 
fitness. A recent study (Andreani et al. 2017) showed a correlation between 
pangenome fluidity and synonymous variation, which was taken to imply that 
genome content diversity is mostly neutral. The implication was that synonymous 
diversity arises in the absence of selection and if this correlates with genome fluidity, 
then genome fluidity is also neutral. The problem with this model is that synony- 
mous diversity in prokaryotes is not necessarily neutral, and we see stronger 
selection on synonymous codon usage in organisms with large effective population 
size (N,) (Sharp et al. 1993), so the correlation between large N, and genome fluidity 
is unlikely to be a consequence of drift alone. 

Recently, a drift-barrier model for pangenome evolution has been proposed 
(Bobay and Ochman 2018). The authors observed a positive correlation between 
pangenome size and N, (using two independent measures of N, for different bacterial 
species). In contrast to Andreani et al. (2017) they propose that, on average, 
accessory genes make a positive contribution to fitness. Based on nearly neutral 
evolutionary theory, they then explain the correlation between N, and pangenome 
size by the loss of slightly advantageous genes in populations with small N,. 
Therefore, populations with large N, would maintain a larger number of accessory 
genes. However, while this may help explain larger genome size (i.e. the mainte- 
nance of more genes), it does not necessarily explain diversity in gene content in 
different individuals from the same population, since those slightly advantageous 
genes would be expected to eventually fix in the population. Furthermore, the 
authors did not deal with the likelihood that, on occasion, these advantageous 
genes would result in sweeps to fixation. The problem with this model is outlined 
in simulations by Niehus et al. (2015). As some of us have previously proposed 
(McInerney et al. 2017), some of the basics of this drift-barrier model, if combined 
with niche adaptation, can go further in explaining the maintenance of genome 
content diversity. Under the adaptive pangenomes model of McInerney et al. 
(2017), accessory genes make, on average, a positive contribution to fitness, and 
this contribution may be niche dependent. Therefore, genes are maintained in the 
niches where they are beneficial and lost in others. However, ongoing migration 
would still allow recombination in other parts of the genome, and thus maintenance 
of large N,, at least for the core genome. 

In line with the McInerney et al. (2017) model of pangenome maintenance by a 
combination of drift and niche-dependence, there is evidence that at least a signif- 
icant fraction of accessory genes are beneficial and involved in niche adaptation 
(Bruns et al. 2018; Rubino et al. 2017; McInerney 2013). The adaptability of 
prokaryotes means that they occupy niches all over the planet—including oceans 
(Sunagawa et al. 2015), ice sheets (Anesio et al. 2017), and salt flats (Caton et al. 
2004), as well as ecosystems deep within the earth's crust (Chivian et al. 2008), and 
on and within our own bodies (The Human Microbiome Project Consortium 2012). 
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Some ‘specialist’ prokaryotic species focus on one, specific niche; for example 
Buchnera aphidicola is an endosymbiont that forms an obligate association with 
aphids (van Ham et al. 2003). Such specialists would likely have little to gain from 
extensive gene content diversity, possibly explaining the relative closeness of some 
species pangenomes. For example, Tropheryma whipplei, an intracellular human 
pathogen and the causative agent of Whipple’s disease (Gorvel et al. 2010), has an 
extremely restricted pangenome (Fenollar et al. 2014), and smaller than average N, 
(Bobay and Ochman 2018). In contrast, ‘generalist’ prokaryotic species can occupy 
many of the niches made available to them. Escherichia coli has been identified in 
several different kinds of environments including the gut and urinary tract of 
humans, and indeed other warm- and cold-blooded animals (Tenaillon et al. 2010), 
as well as soil, sediment, and water (Savageau 1983). In order to occupy such 
variable environments, these species must be able to adapt to different carbon and 
nitrogen sources (Bertin et al. 2011), to evade various antibiotic pressures (Sáenz 
et al. 2004), and to utilise different types of respiration depending on oxygen 
availability (Jones et al. 2007). Recent work on the metabolic potential of accessory 
genes has identified a correlation between the number of novel metabolites that a 
given strain can synthesise and the openness of their pangenome, suggesting that the 
acquisition of such genes is adaptive (Goyal 2018). Other scenarios where variation 
in accessory genes is actively maintained by selection include negative frequency- 
dependent selection (Corander et al. 2017) where a major allele (gene presence or 
absence in our case) is at a disadvantage compared with the minor allele (the other 
character state). For example, in the case of vaccine programmes, it is likely that a 
vaccine targeting a non-essential accessory gene will confer a selective advantage on 
strains that do not have that accessory gene (Azarian et al. 2018). Bacteriophages 
may have a similar effect on non-essential attachment proteins and other cellular 
components. Finally, it is also the case that a particular gene may be beneficial in a 
specific niche when another gene is present, but not so when that partner is absent. 
This co-dependency of genes for fitness/adaptation to a particular niche will manifest 
particular patterns of co-occurrence in genomes (Cohen et al. 2013). 

Notwithstanding the argument being made here that pangenomes are, on average, 
constructed and maintained by niche adaptation, we are still a long way from having 
enough data to say that this understanding is true in all cases. To assess whether 
neutralist or selectionist scenarios warrant greater or lesser support in different 
prokaryotic species and populations, we need more genomic data and information 
on population structure, levels of migration and recombination, and the distribution 
of fitness effects of accessory genes in different niches or environments. This 
requires deep sampling of prokaryotic genomes across space (within and between 
niches) and ideally along time. Recording of information on as many environmental 
variables as possible would also be highly advantageous for understanding which 
factors influence the evolution of pangenomes. 
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6 Keystone Genes and Event Horizon Genes 


The dynamics of accessory gene repertoires is clearly a subject of great interest in 
microbiology. We have a poor understanding of how these repertoires are structured 
and what influences their content, how they grow and are maintained. The process of 
gene loss is also poorly understood. We have outstanding questions about what we 
might term ‘keystone genes’, those genes that play a central role in determining what 
other genes might be successful in a genome. This keystone gene concept is 
analogous to the keystone species concept in macroecology (Paine 1969); keystone 
species are those whose presence or absence can result in a major shift in the make- 
up of a particular ecosystem, often resulting in ecosystem collapse, if the keystone 
species leaves or goes extinct (Estes et al. 1978). 

In a related, but slightly different context, we might consider the case of ‘event 
horizon’ genes. To give an example of the possible existence of such genes, we can 
consider the evolution by gene acquisition of Archaeal halophiles from an ancestor 
that was a methanogen (Nelson-Sathi et al. 2012). This transition must have involved 
the rapid acquisition of a large number of genes. Whereas Haloarchaea are hetero- 
trophic, facultatively anaerobic or aerobic organisms with a phototrophic capability, 
their ancestors the methanogens are obligately anaerobic, methane-producing, 
chemolithotrophic archaea. The differences between these two closely related 
groups illustrate that seismic changes in genome content can occur, but also that 
the absence of intermediate forms suggests that such changes can come about with 
great rapidity. This leads us to the question of which genes, when acquired, led to the 
establishment of the halophile phenotype. In an analogy with astrophysics, we can 
speculate whether there has been an ‘event horizon’ or a point of no return, where the 
acquisition of a particular gene or set of genes permanently converted a methanogen 
to a halophile. We might imagine that the combination of importers of organic 
compounds and genes for heterotrophic metabolism marked the point of no return. 
Indeed, there seems to have been in this case no return, since all halophilic archaea 
are monophyletic and none have abandoned this lifestyle. Therefore, the order of 
gene acquisition and gene loss is an important question. Future work will help 
understand whether these keystone and event horizon genes are common in acces- 
sory gene repertoires. 


7 Some Conclusions and Future Directions 


While evolution has no particular direction, the likely success of a particular 
genomic sequence relates to the notion of ‘unity of purpose’. In this sense, the 
various components of a biochemical pathway can be said to have unity of pur- 
pose—collectively they enable the biological transformation of some important 
molecules. The components of the translation apparatus similarly have a unity of 
purpose. As a corollary, we could say that inserting genes that can enable 
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methanogenesis into the same genome as genes that are responsible for importing 
sugars would not likely lead to a genome with a particularly united purpose—one 
part of the genome would be dedicated to producing energy by chemolithotrophy, 
while another part of the genome would be dedicated to a heterotrophic lifestyle. Yet 
situations like this must surely arise from time to time, given the pervasiveness of 
HGT. Two great unknowns right now include how often such conflicts arise in 
nature, and how compatible are the genes we see in genomes. We know that they are 
compatible enough to give rise to functioning organisms, but we do not know how 
each individual gene contributes to fitness. Background selection and hitch-hiking 
Hill-Robertson effects (Hill and Robertson 1966) are mechanisms that can limit the 
‘impact’ of natural selection and allow maintenance of slightly deleterious variants 
(Price and Arkin 2015), including, we would suppose, accessory genes that have a 
slightly deleterious fitness effect. 

The focus on pangenomes is usually centred on protein-coding genes, but there 
are several other levels at which pangenomes provide food for thought. An analysis 
of E. coli genomes has revealed that selection on non-coding regions has been 
instrumental in shaping the success of a particular sequence type (ST131) of the 
species (McNally et al. 2016). This brings into focus the combinatorial nature of 
genome structure—that the presence or absence of particular kinds of protein-coding 
genes, or even RNA-coding genes is only part of the story, and that the ‘regulatory 
pangenome’ will be one of the most important future challenges. 
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Abstract The rapidly expanding number of sequenced bacterial strains and species, 
and the ongoing curation of bacterial pangenomes has uncovered unexpected com- 
plexities in understanding and addressing antibiotic resistance in the context of the 
pangenome. It is becoming apparent that differences in the genetic background can 
cause species and strain-specific responses to the same antibiotic, triggering differ- 
ential selective pressures and thereby strain or species-specific adaptive outcomes. In 
this chapter, we consider how the pangenome, on a between and within species level, 
can affect the response to antibiotics and the development of resistance as well as the 
role selective pressures such as antibiotics play in shaping and maintaining the 
pangenome. We review the tools that are used to study antibiotic resistance within 
a pangenomic context, highlight recent findings, discuss strategies for predicting the 
emergence of resistance and consider how effective therapies can be developed in 
the context of the pangenome. 


Keywords Pangenome - Antibiotic resistance - Genomics - High-throughput tools - 
Adaptive evolution - Network analyses - Epistasis - Predictions - Machine learning 


1 Introduction 


Antibiotic resistance is a naturally occurring phenomenon that can be found in 
environments containing antibiotic-producing microorganisms, even in the absence 
of human activity (D'Costa et al. 2006). While antibiotic resistance is rampant in 
livestock and is a confounding factor in the emergence and spread of resistance into 
the human population, most research focuses on bacterial pathogens affecting 
humans and in particular the ESKAPE pathogens (Enterococcus faecium, Staphy- 
lococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas 
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aeruginosa, and Enterobacter species) as well as Mycobacterium tuberculosis 
(Santajit and Indrawattana 2016). Instrumental in the development of resistance is 
a bacterium’s inherent ability to survive exposure to low antibiotic concentrations 
giving the population the opportunity to accumulate genomic changes, eventually 
leading to full resistance (Drlica et al. 2008). In clinical practice, bacteria can 
frequently encounter significantly lower drug concentrations in host-niches such as 
the nasopharynx, inner ear, or lungs compared to plasma levels (Rybak 2006). 
Exposure to subinhibitory concentrations of antibiotics may be capable of reducing 
bacterial growth rates, but can fail to fully eradicate infections, providing selective 
pressure for acquired resistance. Outside clinical settings, environments containing 
antibiotics are plentiful, especially due to the rise in antibiotic usage in humans, 
agriculture, and veterinary medicine (D’Costa et al. 2006; Watkinson et al. 2007). In 
such environments, selection for antibiotic resistance is likely and frequent (Gullberg 
et al. 2011). 

There are several mechanisms whereby bacteria can resist antibiotic stress 
including modification of the antibiotic’s direct target, enzymatic drug inactivation, 
and reduction of intracellular drug concentrations via efflux pumps (Walsh 2000; 
McKeegan et al. 2002; Wright 2003) (Fig. 1). Adaptation—the process by which 
bacteria attain such mechanisms of resistance—can happen through two modes: 
horizontal and vertical evolution. The horizontal mode of adaptation (horizontal 
gene transfer; HGT) involves the acquisition of genetic material from organisms that 
share the same environment, whereas the vertical mode of adaptation involves the 
acquisition of de novo mutations. Both modes have an important role in shaping the 
pangenome of bacterial species (Santajit and Indrawattana 2016; Sommer et al. 
2017). The use of antibiotics can exert selective pressures that fix horizontally 
transferred genes or acquired mutations in a population. Examples of HGT include 
integrons carrying mecA which converts methicillin-sensitive S. aureus (MSSA) to 
the resistant “superbug” MRSA (Wielders et al. 2002), beta-lactamases in 
P. aeruginosa, A. baumannii, and various species of Enterobacteriaceae (Weldhagen 
2004), and macrolide resistance in Staphylococcus epidermidis (Lampson et al. 
1986) and Streptococcus pneumoniae (Chancey et al. 2015). Examples of de novo 
resistance mutations are plentiful, including mutations in topoisomerase subunits 
gyrA and parC conferring resistance to fluoroquinolones (Fabrega et al. 2009) or in 
different penicillin-binding proteins, which confer resistance to beta-lactams 
(Murakami et al. 1987; Sauvage et al. 2002; Munita and Arias 2016; Gifford et al. 
2018). Moreover, both modes of evolution can be accelerated by antibiotics. On one 
hand, fluoroquinolones can induce horizontal gene transfer by activating compe- 
tence in S. pneumoniae (Prudhomme et al. 2006; Slager et al. 2014), while on the 
other hand, the use of the same class of antibiotics can increase the mutation rate 
(Lindgren et al. 2003). Importantly, the maintenance of newly acquired resistance in 
a given population, and its dissemination among species, relies heavily on the 
associated fitness cost (Melnyk et al. 2015). For instance, the cost of metabolite 
production in a given reaction may constrain the evolution of antibiotic resistance, 
highlighting the role of bacterial metabolism and environment on antimicrobial 
adaptation (Zampieri et al. 20172). This cost may be different in strains with different 
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genetic backgrounds, suggesting that resistance maintenance depends on the bacte- 
rial metabolic cost/status as well as for instance the bacterial transcriptional profile in 
a particular environment (Cornick and Bentley 2012). These negative fitness costs 
(i.e., reduced growth or replication rates) suggest that in the absence of the antibiotic 
pressure the adaptive mutations would disappear from the population, nevertheless 
adapted populations rarely revert to their wild-type versions, and new mutations can 
compensate for the fitness cost (Andersson and Hughes 2010). 

Antibiotics usually target important cellular functions (e.g., cell wall synthesis, 
DNA replication, or protein synthesis), involving highly conserved and often essen- 
tial genes present in a wide range of bacteria (Hershberg et al. 2008) (Fig. 1). 
However, it has become clear that while antibiotics may have very specific targets, 
the bacterial response to antibiotics and the occurrence of resistance is much more 
distributed across the genome. For instance, we and others have assayed the antibi- 
otic response of bacteria through genetic perturbations (Fajardo et al. 2008; Tamae 
et al. 2008; Breidenstein et al. 2008; Schurek et al. 2008; Girgis et al. 2009; Nichols 
et al. 2011; van Opijnen and Camilli 2012; van Opijnen et al. 2016), which 
established that a large number of genes and pathways can influence drug suscep- 
tibility. These findings underline that we have a limited view of how an antibiotic 
inhibits a bacterial cell; instead of just a drug-target binary interaction, it is a 
complex, multifactorial process that begins with that interaction but propagates 
into various biochemical, metabolic, and regulatory processes of the cell (Tomasz 
1979; Vakulenko and Mobashery 2003; Floss and Yu 2005; Drlica et al. 2008; 
Chandrasekaran et al. 2016; van Opijnen et al. 2016). Thus, a bacterium’s resistance 
to an antibiotic partially stems from the genome-wide program that is triggered by 
that antibiotic. This means that small alterations to this program may establish the 
bacterium on the road to the development of resistance (Albert et al. 2005; El’Garch 
et al. 2007; Kohanski and Collins 2008; Kohanski et al. 2010; Baquero et al. 2011). 
So far it has largely been ignored that the genetic diversity present in a pangenome, 
and the often multiple trajectories that can lead to resistance, can result in strain- 
and/or species-specific-resistant mechanisms with different fitness costs for 
maintaining resistance mutations. We believe that all these factors have contributed 
and are still contributing to the emergence of a diverse resistome, (Davies and Davies 
2010; Blair et al. 2015; Munita and Arias 2016) that only makes sense when viewed 
from a pangenomic context and which makes both the discovery and tracking of 
resistance as well as the treatment of resistant bacteria far more complex than 
previously thought. 


1.1 Species- and Strain-Specific Differences in Adaptation 
to Antibiotics 


The influence of the pangenome on the complexity underlying the evolution of 
resistance can be seen both on a between and within species level. For instance, 
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different mechanisms and mutations that trigger resistance to the same antibiotic 
have mainly been found between species. Additionally, interactions between drug- 
resistance mutations and genetic backgrounds triggering differences in resistance 
levels have been found at both the between and within species level. In the following 
section, we discuss how the pangenome, or genetic differences between strains and 
species, affect the mechanism and/or the level of resistance that evolves. 


1.1.1 Species-Specific Resistance: There Is More Than One Way 
to Become Resistant 


Due to specific (pangenomic) genetic characteristics, different species can adopt 
different mechanisms of adaptation to become resistant to the same antibiotic. 
Additionally, different species can acquire different mutations or genes to achieve 
the same resistance mechanism to the same antibiotic. A well-documented example 
of the first scenario is beta-lactam-resistant mechanisms among Gram-negative and 
Gram-positive bacteria. Beta-lactam antibiotics inhibit bacterial cell wall synthesis by 
targeting penicillin-binding proteins (PBPs), a group of enzymes that are present in all 
bacterial species and which catalyze peptidoglycan cross-linking. PBPs interact with 
beta-lactams via an active site serine and form a relatively stable covalent complex 
(Sibold et al. 1994). The primary resistance mechanism against clinically important 
beta-lactams (e.g., penicillin, carbapenem, cephalosporin) is different between Gram- 
negative and Gram-positive bacteria. In Gram-negative bacteria, beta-lactam resis- 
tance is commonly driven by the acquisition of hydrolyzing beta-lactamases that 
inactivate the drug. In contrast, beta-lactam resistance in most Gram-positive species 
is mediated by target modifications, with the exception of staphylococcal penicillin- 
ase (Rosdahl 1985; Skov et al. 1995). For instance, beta-lactam-resistant Enterococ- 
cus faecium have acquired mutations in an essential PBP (PBP5) that reduce the 
accessibility of the active site and result in a low-affinity form (PBP5fm) (Sauvage 
et al. 2002). Similar low-affinity PBPs have also been reported in methicillin-resistant 
S. aureus (MRSA) (Murakami et al. 1987) and in beta-lactam-resistant strains of 
S. pneumoniae (encoded by mosaic genes acquired through HGT) (Sibold et al. 1994; 
Reichmann et al. 1996). It seems likely that this divergence in beta-lactam-resistant 
mechanisms between Gram-negative and Gram-positive bacteria arose from the 
differences in their cell envelopes (Munita and Arias 2016). In Gram-negative 
bacteria, the presence of an outer membrane and associated porins allows for the 
entry and accumulation of beta-lactams in the periplasmic space, prior to binding 
PBPs in the inner membrane (Fig. 2a). Such compartmentalization allows for beta- 
lactamase accumulation at sufficient concentrations and effective deconstruction of 
the beta-lactam molecules. 

While species-specific differences in antibiotic resistance can come from very 
different mechanisms, there are examples where the target is the same, but the 
manner in which it is targeted is different. For example, macrolides target the 
peptidyl site of nascent peptides in the large subunit of bacterial ribosomes, thereby 
inhibiting protein synthesis. Many cases of clinical macrolide resistance are caused 
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by mutations at specific nucleotide positions in the 23S rRNA. Due to differences in 
the copy number of the ribosomal RNA operon (rrna), different species have been 
shown to have different macrolide-resistant mutations in the 23S rRNA gene 
(Fig. 2b). Generally, mutations at A2058 or A2059 in the 23S rRNA (using E. coli 
nucleotide sequence numbering) confers macrolide resistance for many pathogenic 
bacteria, predominantly bacteria with one or two copies of the rrna operon, such as 
azithromycin-resistant Treponema pallidum (Stamm and Bergen 2000; Matejkova 
et al. 2009), clarithromycin-resistant Mycobacterium species (Meier et al. 1994; 
Nash and Inderlied 1995; Wallace et al. 1996), and Helicobacter pylori with 
resistance to macrolide-lincosamide-streptogramin B antibiotics (termed as the 
MLSs phenotype) (Wang and Taylor 1998). A2058 and/or A2059 mutations change 
the structure of the drug-binding pocket and thereby reduce the binding affinity of 
the drug contributing to resistance. In bacteria with higher copy numbers of rrna, 
such as Staphylococcus, Enterococcus, and Streptococcus, acquisition of point 
mutations on all or multiple copies of the 23S rRNA genes is highly improbable. 
Instead, macrolide resistance via 23S rRNA modification is frequently achieved by 
erm-methylation of target nucleotides. Erm genes are mobile genes that encode 23S 
rRNA methylases and can catalyze dimethylation of A2058 (Toh et al. 2007). In 
S. pneumoniae, ErmB provides a high level of resistance to erythromycin 
(MIC > 256 pg/mL) (Schroeder and Stephens 2016), which suggests that resistance 
level conferred by the same mutation is also dependent on the genetic background. 


1.1.2 Interactions Between Resistance Mutations and Genetic 
Background Can Affect the Level of Resistance 


While it may not come as a complete surprise that different species can adopt different 
strategies to overcome resistance, recent studies have shown that when different 
species or strains do have the same strategy to become resistant, the same mutation 
does not automatically result in the same level of resistance. This can be caused by 
differences in the genetic background and is a good example of how genetic differ- 
ences between species and strains, can have important effects on (the emergence of) 
resistance. Examples at the species level are loss-of-function mutations of the 16S 
rRNA-specific methyltransferase GidB involved in streptomycin resistance 
(Okamoto et al. 2007; Koskiniemi et al. 2011). Streptomycin, an aminoglycoside 
antibiotic, binds to the 30S subunit of the ribosome and causes misreading of the 
correct tRNA. These mutations have been identified in low-to-intermediate levels of 
streptomycin resistance in multiple bacteria, such as M. tuberculosis, Mycobacterium 
smegmatis, S. aureus, and E. coli (Okamoto et al. 2007; Wong et al. 2011; Perdigao 
et al. 2014). Koskiniemi et al. (2011) showed that high-level streptomycin resistance 
caused by the loss of GidB is largely dependent on the presence of an aminoglycoside 
adenyltransferase (AadA) in the bacterium's genome, which is an enzyme that 
modifies and thereby inactivates aminoglycosides (Tait et al. 1985; Svab et al. 
1990; Magrini et al. 1998; Frank et al. 2003). In an experimental evolution study, 
streptomycin-adapted Salmonella typhimurium strains that have both the aadA gene 
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and gidB mutations gained a higher level of streptomycin resistance than strains 
having either one alone (Wistrand-Yuen et al. 2018). Species with aadA, e.g., 
S. typhimurium, can thereby gain a high level of streptomycin resistance, while 
species that lack this enzyme (e.g., E. coli, S. aureus, and M. tuberculosis) only 
obtain low-level resistance (Okamoto et al. 2007). 

Within a species, the same mutations may also not necessarily result in the same 
level of resistance. One such example is resistance in M. tuberculosis to isoniazid 
(INH). As a prodrug, INH must be processed by the mycobacterial enzyme KatG into 
its active form, isonicotinic acyl-NADH. The active drug then binds the enoyl-acyl 
carrier protein reductase InhA and blocks the synthesis of mycolic acid (Quémard 
et al. 1991). In M. tuberculosis, the primary INH-resistance mechanism is via a point 
mutation in KatG (e.g., S315T), which results in a partially active protein that reduces 
INH binding while retaining enough activity to support bacterial survival. Another 
frequently observed resistance mutation is in the promoter region of the target gene 
inhA (Lee et al. 2001). Strains that have inh promoter mutations have been observed 
to show different levels of INH resistance based on their phylogenetic lineages. 
M. tuberculosis is grouped in six main phylogenetic lineages (Hershberg et al. 
2008; Comas et al. 2010): three modern lineages that have evolved in regions with 
high-density populations and recent massive demographic expansion (i.e., lineage 4: 
Europe and America, lineage 3: India and East Africa, lineage 2: East Asia) and three 
ancient lineages from older and low-density populations (i.e., lineage 1: the Philip- 
pines, lineage 5: Rim of Indian ocean, and lineage 6: west Africa) (Portevin et al. 
2011). A study of 158 isolates of multidrug-resistant M. tuberculosis revealed that 
mutations in the inhA promoter cause high level of INH resistance (73.0 g/mL) only 
in the modern lineages 2 and 3, while these mutations cause low-level resistance 
(MIC «3.0 pg/mL) mainly in ancient lineages 1 and 5 (Fenner et al. 2012). Although 
M. tuberculosis harbors limited genetic diversity compared to other species, multiple 
studies have suggested that the variation in drug-resistant phenotypes of 
M. tuberculosis could be at least partially explained by epistatic interactions among 
the genetic background of different phylogenetic lineages, compensatory mutations 
and drug-resistance mutations (Gagneux et al. 2006; Fenner et al. 2012; Gygli et al. 
2017). 


1.1.3 The Ability of Evolving Antibiotic Resistance May Vary Across 
Species Due to Epistatic Interactions and/or *Potentiator" Genes 


Apart from epistatic interactions between genetic background and drug-resistance 
mutations, the presence of potentiator genes can make it possible for a novel trait to 
evolve that would otherwise be inaccessible (Blount et al. 2012; Lind et al. 2015). 
Depending on the genetic background, the presence of potentiators of antibiotic- 
resistance genes can prime strains to evolve resistance. To uncover the role of 
potentiators in different genetic backgrounds, Gifford and colleagues evolved eight 
strains in the Pseudomonas genus to the beta-lactam antibiotic ceftazidime and 
compared their pathways that led to resistance (Gifford et al. 2018). Their results 
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show that Pseudomonas species that have the transcription factor ampR 
(P. protegens and P. fluorescens) evolve ceftazidime resistance faster than species 
lacking this gene (P. mendocina or P. fulva) (Gifford et al. 2018). AmpR has been 
shown to increase the expression of beta-lactamase ampC upon the inactivation of 
peptidoglycan synthesis (Mark et al. 2011; Ropy et al. 2015). The authors hypoth- 
esized that ampR potentiates ceftazidime adaptation by allowing mutations in 
peptidoglycan biosynthesis genes such as ampD, pml, and dacB. Indeed, in 
P. aeruginosa, dacB inactivating mutations have only been observed in genetic 
backgrounds harboring ampR (Moya et al. 2009; Mark et al. 2011). These findings 
show that (clinically relevant) high-resistance level markers (e.g., mutations, genes 
acquired by HGT) should be considered and validated in different genetic back- 
grounds, and thus in a pangenomic context. 


1.2 Strain- and Species-Specific Phenotypic Stress Responses 
to Antibiotics 


Recent advances in high-throughput techniques involving mutant libraries as well as 
various omics approaches have allowed for unprecedented understanding of how 
bacteria respond to antibiotic-mediated stress. Such strategies have shown the 
diversity of antibiotic responses within species represented by a large pangenome 
as well as between species. Various examples discussed below show that antibiotics 
can induce stress throughout the bacterium both at the direct target of the antibiotic 
as well as at off-target pathways throughout the genome. Due to the pangenome and 
the consequent differences in genetic backgrounds, strains and species respond to 
antibiotics with (slightly) different sets of genes and thereby experience antibiotic 
stress in different ways. This means the selective pressures a bacterium experiences 
can be strain and/or species specific and drive the evolution of resistance in a strain- 
or species-specific manner. As a result, the pangenome not only affects the manner in 
which stress is experienced, but that same stress (e.g., antibiotics) also contributes to 
maintaining and expanding the pangenome. 


1.21  High-Throughput Tools for Investigating the Bacterial Response 
to Stress 


With the rise of low-cost sequencing options, whole genome sequencing (WGS) has 
proved useful for identifying antibiotic-resistant bacteria by looking for the presence 
of certain genes (e.g., efflux pumps), insertion-deletions, and other polymorphisms 
associated with antibiotic resistance (Boissy et al. 2011; Zankari et al. 2012; Liu 
et al. 2014; McDermott et al. 2016; Zeng et al. 2018). The increased availability of 
large collections of bacterial whole genome sequences has allowed the identification 
of numerous single nucleotide polymorphisms (SNPs) associated with drug 
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resistance through genome-wide association studies (Power et al. 2017). Resistance- 
associated SNPs have been identified for a number of pathogenic bacteria including 
M. tuberculosis (Desjardins et al. 2016), S. pneumoniae (Chewapreecha et al. 2014), 
and S. aureus (Alam et al. 2014). For antibiotic surveillance, the ability to identify 
features such as SNPs means WGS provides much more detailed information 
compared to traditional phenotyping such as multilocus sequence typing (MLST). 
This increased resolution can be used to predict antibiotic resistance for clinical 
isolates based on databases of known antibiotic-resistance determinants (Sandgren 
et al. 2009; McArthur et al. 2013; Stoesser et al. 2013; Walker et al. 2015; Lakin 
et al. 2017). Recent work has even demonstrated the ability to identify resistant 
strains as the sample is sequenced (Brinda et al. 2018) potentially leading to point- 
of-care devices which can guide appropriate use of antibiotics by clinicians. Never- 
theless, predictions of resistance are limited to the antibiotics that have been previ- 
ously tested (such as clinically important first- and second-line antibiotics), which 
hampers their utility in predicting bacterial responses to novel antibiotics. Thereby, 
WGS provides a snapshot of the presence or absence of resistance determinants but 
cannot directly provide information on what genes or pathways are involved in 
responding to the stress induced by antibiotics. Consequently, while WGS and 
MLST are highly useful for resistance surveillance and may guide treatment options, 
they are more limited in their ability to tease apart phenotypic responses to antibi- 
otics for the purpose of understanding and potentially predicting how resistance 
develops. 

In contrast, the use of ordered mutant libraries can directly link genes to observed 
phenotypes (Jacobs et al. 2003; Baba et al. 2006), which have allowed the detailed 
characterization of how bacteria respond to various antibiotics (Nichols et al. 2011). 
However, these libraries are limited by being time consuming to construct, making it 
less amenable for a wide variety of bacteria. The advent of techniques such as 
Tn-Seq (van Opijnen et al. 2009), INSeq (Goodman et al. 2009), HiTS (Gawronski 
et al. 2009) and TRADIS (Langridge et al. 2009), and variants like RB-TnSeq 
(Wetmore et al. 2015; Price et al. 2018) and droplet Tn-Seq (Thibault et al. 2019) 
offer a high-throughput alternative which is easily adaptable. In general, all these 
techniques rely on generating transposon-insertion libraries, which can be assayed 
by high-throughput sequencing for the relative frequency of mutants grown in a 
particular stress-inducing environment such as subinhibitory concentrations of anti- 
biotics. In this way, the phenotype of each genetic mutant can be determined, 
showing directly how bacteria respond to antibiotics and the genes that benefit or 
hinder the bacteria's ability to respond to this stress. Thanks to a diverse number of 
transposon systems and the relative ease of creating mutant libraries, these tech- 
niques are amenable to a wide variety of bacterial species and individual strains, 
providing data within the context of the genetic background of each assayed strain. 
Characterization of the response to antibiotics can also be complemented by various 
"omic" approaches. These include transcriptomic (Jensen et al. 2017; Qin et al. 
2018), metabolomic (Zampieri et al. 2017b), and proteomic (Pérez-Llarena and Bou 
2016; Ma et al. 2017) analyses. The datasets generated by these techniques can also 
be overlaid with one another to provide a holistic understanding of how bacteria 
respond to antibiotic stress (Jensen et al. 2017). 
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Studies utilizing Tn-Seq and related methods have shown that antibiotic-induced 
stress involves the target of the antibiotic and also extends throughout the entire 
genome of the bacterium. For example, fluoroquinolones like ciprofloxacin, 
levofloxacin, and norfloxacin target topoisomerase IV and DNA gyrase, critical 
enzymes utilized in DNA synthesis. In the Gram-positive S. pneumoniae and the 
Gram-negative A. baumannii, Tn-Seq profiles for fluoroquinolones show that genes 
involved in DNA replication and repair such as recN and xseA are important for 
responding to these antibiotics. While these genes are not direct targets, the inhibi- 
tion of DNA replication by targeting gyrase and topoisomerase triggers DNA 
damage and thus explains the indirect importance of genes involved in DNA repair 
(van Opijnen and Camilli 2012; Geisinger et al. 2019). In addition, Tn-Seq profiles 
show a role for genes even beyond those related to DNA repair and replication and 
indicate the importance of genes with diverse functions including amino acid and 
carbohydrate metabolism. In P. aeruginosa, the aminoglycoside tobramycin also 
involves a diverse number of responsive genes, including those involved in cell 
division, carbohydrate metabolism, and membrane metabolism (Gallagher et al. 
2011). Similar findings can be observed in data from E. coli where colony sizes 
were measured for an ordered mutant library grown in the presence of various 
stressors, including antibiotics (Nichols et al. 2011). For example, a screen with 
trimethoprim/sulfamethoxazole, which targets the folate biosynthesis pathway 
shows an important role for genes involved in this pathway, including mogA and 
folM, as well as genes involved in nucleotide metabolism. But again, responsive 
genes also include those involved in carbohydrate metabolism, glycan biosynthesis, 
and membrane transport. These examples highlight that while stress may be felt 
acutely at the antibiotic’s target, it extends beyond the primary target and results in 
selective pressures acting throughout the genome. The importance of this is further 
confirmed by the observation that resistant clinical isolates often have mutations at 
sites throughout the genome that resolve such stress and/or work in a compensatory 
manner (Albert et al. 2005; El’Garch et al. 2007). Interestingly, targeting genes 
involved in off-target responses can create an opportunity for therapeutic interven- 
tion by generating synergy between the off-target gene/response and the 
assayed drug. 


1.2.3 Strain-Specific Responses to Antibiotic Stress 


In addition to showing that stress can reverberate throughout the genome, Tn-Seq is 
able to reveal how the genetic background of a strain affects the response to 
antibiotic stress. Several examples have shown that the genes and pathways involved 
in responding to antibiotic stress can be strain specific. For instance, S. pneumoniae 
strains TIGR4 and Taiwan-19F are similarly susceptible to daptomycin, however, 
Tn-Seq results show that only 5096 of the genes responding to daptomycin are 
common to both strains, with the other 5096 being strain specific (van Opijnen 
et al. 2016) (Fig. 3). Moreover, the distribution of the functional categories of the 
responsive genes is significantly different between the two strains. This lack of 
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conserved response is also observed for antibiotics representing fluoroquinolones, 
aminoglycosides, and glycopeptides, with only 40-50% of the responsive genes 
conserved between these two strains for a particular antibiotic. Nevertheless, when 
the functional categories are combined into larger groupings corresponding to 
different domains of the cell’s physiology, there is no difference in the distribution 
between the two strains. This suggests that despite strain-specific differences in 
response at the gene level, the global response is more similar (van Opijnen et al. 
2016). 

In Mycobacterium tuberculosis, in vitro Tn-Seq experiments have shown that 
several clinical strains have an increased requirement for the gene encoding KatG, 
compared to reference strain H37Rv (Carey et al. 2018). As discussed, KatG is an 
activator of the first-line M. tuberculosis antibiotic isoniazid, and adaptation exper- 
iments have shown that loss-of-function mutations in katG can result in isoniazid 
resistance. However, such mutations occur at a low frequency in clinical strains 
(Gagneux et al. 2006; Vilchéze and Jacobs 2014), which suggests that the increased 
fitness cost of mutating katG in clinical strains decreases the frequency of acquisition 
of isoniazid mutants compared to H37Rv. Furthermore, Tn-Seq identified minimal 
fitness costs for losing glcB (a maleate synthase involved in the glyoxylate shunt, 
which is important for carbon and fatty acid metabolism) in some clinical strains, 
whereas it is highly important in other strains (Carey et al. 2018). The authors 
hypothesized that such differential requirements for g/cB would result in correspond- 
ingly differential responses to a novel inhibitor of this protein. Indeed, they found 
that strains showing less of a requirement for g/cB are less susceptible to the 
inhibitor. This type of variability illustrates how the pangenome affects responses 
and consequently adaptive solutions to antibiotic stress and underscores why ther- 
apies may not produce consistent results across all strains. Furthermore, the finding 
that strains can demonstrate considerable variation in their response to antibiotics 
underscores how caution must be taken when evaluating studies that are based on a 
single strain and thereby ignore differential responses that may be present through- 
out the pangenome. 


Fig.3 Strain-specific differences in responses to the same antibiotic. Networks show the relative 


number of responsive genes of a given functional category responding to either amoxicillin (a) or 
daptomycin (b) for S. pneumoniae strains TIGR4 and Taiwan 19F. The number of genes for each 
group is shown in the charts on the right side. Note the diversity of functional categories beyond 
the membrane target of both antibiotics. Each strain also responds to the antibiotics with slightly 
different functional categories. While the strains appear to lack genetic and functional conser- 
vancy, they do respond globally in the same way, when the functional categories are condensed 
into categories involving the capsule, membrane, cellular control, and metabolism. (c) The 
functional categories show a similar diversity of functions when responding to aminoglycosides, 
glycopeptides, and fluoroquinolones for both TIGR4 and 19F 
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1.2.3 Gene Homology Frameworks to Uncover Differential Responses 
Across Bacterial Species 


In addition to considering strain-specific responses to the same antibiotic, we have 
assessed stress responses at the species level to determine how similar antibiotic 
response patterns are. By combining data from a variety of sources (Nichols et al. 
2011; Murray et al. 2015) and generating two frameworks utilizing the OMA and 
PATRIC databases (Wattam et al. 2017; Altenhoff et al. 2018) the responses of 
E. coli, P. aeruginosa, A. baumannii, and S. pneumoniae to ciprofloxacin, could be 
compared. While not all responsive genes have homologs in all species, a consistent 
pattern is observed for these diverse species. Genes involved in DNA replication and 
repair such as recN and xseA are important in all four species and in both homology 
frameworks. Additional nonhomologous genes annotated as involved in DNA 
replication and repair are also observed in each of the four species. Each species 
also has responsive genes that are involved in various metabolic functions and 
cellular processes not related to DNA repair. Nevertheless, pairwise strain compar- 
isons indicate only 5-10% of homologs not involving DNA replication and repair 
are shared between one or more species. This suggests that species may respond to 
antibiotic stress similarly at the antibiotic target and related pathways, but individ- 
ually each species is responding with a unique program depending on its genetic 
background and thus coping with unique selective pressures that can influence the 
emergence of resistance. 


1.3 The Role of the PanGenome in Predicting Adaptation 
to Antibiotic Stress 


We have so far discussed that mutations responsible for the resistance phenotype to a 
certain antibiotic can either be common or be specific to a strain or a species. 
Nevertheless, the types of adaptive-resistance mutations, and the order in which 
they arise and fix in a population (adaptive trajectories) have shown to be replicable 
(Elena and Lenski 2003), which suggests adaptive evolution is, at least to a certain 
extent, constrained. In this section, we argue that the genetic background and the 
environmental context are two major factors that constrain adaptive evolution (e.g., 
during adaptation to antibiotics). We first discuss the role of genetic interactions and 
how they reduce the number of available adaptive trajectories, and propose a 
pangenome-wide view of studying genetic interactions. Next, we discuss the possi- 
bility of using how a selective pressure in the environment is sensed and experienced 
by an organism (environmental context) to predict where on the genome adaptive 
mutations will appear when the selective pressure is maintained. We argue that the 
predictions based on environmental context can be improved by the addition of 
pangenome-related information, such as the conservation of genetic sequences across 
many related organisms. 


184 S. Wood et al. 


1.3.1 Adaptive Evolution Is Replicable, Therefore Predictable 


The analysis of sequence sets on a pangenome scale allows associations to be made 
between genetic changes and antibiotic-resistant phenotypes. Such pangenome-wide 
association studies have revealed common sets of mutations that appear in organisms 
resistant to a certain antibiotic (Croucher et al. 2011; Mobegi et al. 2017a; Del 
Barrio-Tofifio et al. 2017). Moreover, phylogenetic reconstructions suggest the same 
resistance-causing mutations have appeared independently, and multiple times in 
geographically separated strains (Croucher et al. 2011; Farhat et al. 2013; 
Chewapreecha et al. 2014). While, such ad hoc associations have the power of 
explaining the genetic basis of a certain phenotype, they rarely offer a predictive 
model for future adaptive trajectories. However, the observation that the same 
mutations have appeared in different pathogens independently suggests that adaptive 
evolution is not an entirely random process. This can be further seen in lab-directed 
adaptation experiments, where common sets of mutations keep reappearing in 
independent populations under the same selective pressure (Lang et al. 2013). 
These common adaptive trajectories demonstrate the replicability of adaptive evo- 
lution, which is not to say evolution is an entirely deterministic process. The 
emergence of new sequence variants is stochastic and phenomena such as 
hitchhiking genetic regions, genetic drift, and clonal interference can incorporate 
different degrees of randomness influencing which mutations will reach fixation and 
how (Lang et al. 2013). Yet, the replicability of adaptive evolution in antibiotic 
resistance suggests that there are a limited number of adaptive trajectories available 
to the adapting organism. In other words, while there are many possible ways a set of 
resistance mutations can reach fixation, the majority of those trajectories are not 
plausible because they are constrained by the environment and the genetic context. 
This means that if the environmental and genetic constraints a bacterium evolves 
under can be understood and/or (experimentally) captured, adaptive evolution 
should become predictable. 


1.3.2 Genetic Constraints on Adaptation 


In order to understand the genetic constraints on adaptation we need to consider 
epistatic interactions within the genome. Epistatic interactions are defined as the 
nonadditive effects of combinations of mutations. For instance, mutations can have 
different effects on fitness, depending on the genetic background of the organism, 
i.e., what other mutations are already present in the parental strain (Vogwill et al. 
2016). This is well illustrated by experiments that compared the fitness of single 
mutants with combinations of those singlets into double and triple mutants, where 
the fitness of the double and triple mutants differed from what is expected under the 
multiplicative model (i.e., in the absence of epistasis, the fitness of combining 
mutant A and mutant B in the same genome = fitness of A x fitness of B) (Weinreich 
et al. 2006; Angst and Hall 2013; Hall and MacLean 2016). Moreover, epistasis 
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influences the order in which mutations appear. If two mutations do not interact with 
each other, then any order in which they appear is equally likely. However, when 
mutations do interact, the appearance of one may limit the appearance of the other. 
For instance, when the interactions of 5 mutations that confer resistance to 
cefotaxime were mapped out in E. coli, only 10 trajectories (out of 120 possible 
ones) turn out to have non-negligible probabilities of being observed (Weinreich 
et al. 2006). Another example of this is where lab adaptation of a beta-lactamase in 
E. coli is limited in its trajectory and will follow a certain path (that with the highest 
likelihood) when a specific initial mutation is present (Salverda et al. 2011). 

Since epistatic interactions limit the adaptive trajectories to a few likely ones, 
mapping out epistatic interactions can help determine which trajectories are most 
plausible, and thereby contribute to predictions of adaptive evolution. A high- 
throughput way of determining epistatic interactions is using genome-wide double- 
knockout screens, as has been done extensively in Saccharomyces cerevisiae (Tong 
et al. 2004), and Schizosaccharomyces pombe (Roguev et al. 2008). In these studies, 
synthetic lethality, which is an “extreme” form of epistasis, was used to build 
genome-wide epistasis or genetic interaction networks. These networks show the 
prevalence of epistatic interactions throughout the entire genome, with most genes 
interacting with at least one other gene, and a few hubs with numerous interactions. A 
comparison of genetic interaction networks from S. cerevisiae (Tong et al. 2004), and 
S. pombe (Roguev et al. 2008) demonstrate that the same interactions are not always 
present, even when considering genes common to both organisms. In other words, 
while some interactions are conserved, others may be present or absent, depending on 
the genetic background. This means that the genetic interaction network of a single 
organism is not representative of a pangenome-wide genetic interaction network. 
Therefore, predictions of adaptation based on a single-strain network will be limited 
to that organism. 


1.3.3 A Pangenome-Wide View of Epistasis May Enhance Predictions 


Epistatic interactions are more than a collection of gene- or locus-pairs, but rather 
form a complex network that has both components that are universally true (those 
interactions that are strain or species independent), and components that are only 
present in a certain strain or species. When epistatic interactions (on a gene level) are 
mapped for a single strain, the interactions are limited to the genes present in this one 
strain. However, the lack of a gene is not equivalent to the lack of the influence of 
that gene. The fact that the gene is absent may actually affect the fitness of strains 
with this particular genetic background. It is possible that genetic elements that vary 
considerably in their presence or absence across different strains interact 
epistatically. Such interactions have indeed been demonstrated between chromo- 
somal mutations and plasmids (Silva et al. 2011) or mobile elements (Stoebel and 
Dorman 2010), or even between two plasmids (San Millan et al. 2014). Therefore, 
studies mapping out genetic interactions in a single strain or species can be limiting, 
showing only the components of a network that applies to the organism being 
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studied. In order to get a comprehensive view, it is necessary to construct a 
pangenome-wide genetic interaction network. 

One possibility is to map out genetic interaction networks on hundreds of related 
strains/species. However, even with high-throughput screening methods, this 
approach is limited by time and cost. A more feasible alternative would be inferring 
genetic interactions through in silico analysis of a collection of genomic sequences. 
In a simple model, one can assume there is an underlying network of epistatic 
interactions, where each gene’s state (present or absent) influences the states of the 
genes it is interacting with [analogous to the Ising model describing particle spin 
states in statistical mechanics (Ising 1925)]. Each viable organism can then be 
described as a configuration—the presence and absence state of all genes in the 
pangenome. In this model, the underlying genetic interaction network results in 
some configurations being more likely than others. It is reasonable to assume that 
viable organisms are the more likely configurations. Based on this assumption, and 
considering the observed states of each gene from many genomes, it is possible to 
infer the underlying network connectivity between genes (Bresler 2014) and identify 
interactions between genes that are more likely to be universally true, and not strain/ 
species specific. Such a comprehensive genetic interaction network should give a 
much better idea about pangenome-wide constraints on adaptive evolution. In fact, 
the fitness landscape (a popular visual metaphor for the effect of genotypes on 
fitness) is a pangenomic concept. This long-standing visual lays-out the possible 
genotypes of an organism (or the existing genotypes in a pangenome) on a flat 
horizontal surface, and the fitness of each genetic variant is plotted on the vertical 
axis. Thus, because the fitness landscape considers many genomic variants at once, it 
inherently represents a pangenome view of fitness. The classical view of the fitness 
landscape is that there is a single peak of fitness, and an organism adapting under a 
selective pressure climbs this fitness peak as it accumulates mutations. However, 
increasing numbers of epistatic interactions result in the fitness landscape becoming 
decorated with peaks and valleys, forming a rugged surface (Kauffman and Wein- 
berger 1989). This apparent increase in complexity may also explain certain strain- 
specific adaptive outcomes, as it becomes clearer where local fitness maxima and 
minima are situated on the landscape. Consequently, the consideration of the 
pangenome (rather than single genomes) should uncover a comprehensive genetic 
interaction network. 


1.3.4 Toward Predicting Adaptive Evolution and the Importance 
of Pangenomic Information 


The fitness landscape has long been considered a constant and rigid surface for each 
organism. However, genotype is not the only determinant of fitness—the same 
organism’s fitness varies in different environments. Thus, the fitness landscape is a 
much more fluid concept, and its shape/contour depends on environmentally deter- 
mined selective pressures. In other words, in addition to the genetic context 
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determining fitness outcome and constraining adaptive evolution, environmental 
context also plays an important role. 

Similar to how genetic interaction networks can reveal genetic constraints on 
adaptive evolution, multi-omic profiling reveals environmental constraints on adap- 
tive evolution. The manner in which a bacterium will adapt to the environment it 
finds itself in is linked to how a stress, i.e., a selective pressure (e.g., antibiotic), is 
sensed and processed by the bacterium (Zhu et al. 2018, 2019). The use of multi- 
omic profiling (e.g., via Tn-Seq, RNA-Seq) can reveal which genomic loci respond 
to and are important in overcoming stress in the environment. For instance, Tn-Seq 
experiments identify the genes that contribute to fitness (phenotypically important 
genes, or PIGs) under the stress, and RNA-Seq experiments reveal transcriptionally 
important genes (or TIGs) responding to the stress. A simple assumption would be 
that because PIGs and TIGs are relevant in the organism’s response to stress, they 
will also be implicated in resolving this stress over the course of adaptive evolution. 
In other words, genes that acquire adaptive mutations will be TIGs and/or PIGs. 
While in some cases, PIGs and/or TIGs acquire adaptive mutations, not all adapted 
genes are PIGs or TIGs (Fig. 4). This, along with the transcriptional and phenotypic 
responses often involving many different cellular functions, makes it challenging to 
find straightforward rules that predict which genes will adapt. This has motivated the 
use of machine learning algorithms to detect potentially multifactorial and compli- 
cated determinants of adaptive evolution (Zhu et al. 2018; Wang et al. 2018b). 
Moreover, where changes in expression and fitness are situated in a network can help 
inform which genetic changes may or may not be permissible. One can use regula- 
tory networks, protein-protein interaction networks or genome-scale metabolic 
models to contextualize the stress response. It turns out that with the inclusion of 
network features such as degree (how many connections does a gene have) or 
clustering coefficients (how many of a gene's neighbors are neighbors of each 
other), machine learning models can be used to predict in which genes adaptive 
mutations are most likely to occur (Zhu et al. 2018). Moreover, sequence conserva- 
tion and prevalence, which are features that can be extracted from the pangenome, 
and which describe how "plastic" (or variable) each gene is, improve prediction 
accuracy (Fig. 4c) (Zhu et al. 2018). While this is a step toward predicting the 
emergence of resistance before it actually occurs, for instance during treatment, 
incorporation of pangenome-wide genetic interaction networks will likely even 
further enhance the predictive power and accuracy. 


1.4 Developing New Therapeutics in the Light 
of the Pangenome 


There is an urgent need to develop new strategies to combat resistant pathogens. 
Both essential genes and genes required for virulence provide attractive targets for 
the development of new drugs or biologicals (Clatworthy et al. 2007; Juhas et al. 


S. Wood et al. 


188 


ayes Əayısod asje4 
L 60 80 40 90 SO vo €0 TO ro 0 


170160 :seunjeay auuouebued jnoyyM DNY 
841660 :seunyeay awouabued ym DNY 


u 
S 
91e1 oAnisod andy 


obueu» 
uolssəÀidxg 


uoleaiasuod 
auanbas 


aduajeAaid 
aouanbas 


sauab [enuess3 
sbueup sseujl4 


ADuanbay uoneiniw 


A Pangenomic Perspective on the Emergence, Maintenance, and Predictability of. . . 189 


2011; Mobegi et al. 2014). Such candidate targets have been identified for many 
species by combining functional experimental analyses like Tn-Seq, RNA-Seq, or 
CRISPRi with computational predictive models (Mobegi et al. 2017b). However, it 
is becoming more and more apparent that a gene identified as essential or required 
for infection in one specific strain is not necessarily essential or required for growth 
or infection in a different genetic context (Rancati et al. 2018). As a consequence, 
pangenome variability must be taken into consideration when developing new 
therapeutics that work at a species-wide level. 


1.4.4 Targeting Essential Genes 


A gene is essential if it is indispensable for reproductive success, which in the case of 
unicellular organisms are those genes that are required for replication (Rancati et al. 
2018). A loss-of-function mutation in one of these genes, or a drug that inactivates its 
function will stop growth. That is why the identification of a pathogen's 
essentialome (i.e., the set of essential genes in a defined genome or group of 
genomes) is an attractive approach for the identification of new drug targets. 
Currently, Tn-Seq and related techniques are probably the most popular experi- 
mental tools used to determine essentialomes (Peng et al. 2017). Genes that lack 
insertions in saturated transposon libraries selected in rich media, are considered to 
be highly likely to be essential in any given condition. CRISPRi is another technique 
that is rapidly gaining popularity for determining gene-essentiality in both prokary- 
otes and eukaryotes (Peters et al. 2016; Liu et al. 2017; Wang et al. 20182). 
However, since many more strains and species exist than can efficiently and rapidly 
be experimentally screened for their essential genes, in silico predictive models of 
gene essentiality are receiving increasing interest (Mobegi et al. 2017b; Nigatu et al. 


Fig. 4 Prediction of adaptive evolution relies on pangenome features. (a) Circular plot of the 


S. pneumoniae chromosome, with all features necessary for accurate prediction of which genes 
will contribute with adaptive mutations to vancomycin resistance. Importantly, there is no clear 
association with any dataset alone and the adaptive outcome, however, when taken as a whole, all 
data types contribute to distinguishing adapted genes from non-adapted ones (see (c) and Zhu 
et al. 2018). (b) Legend for (a). From innermost plot to outermost: Expression change: log» Fold 
Change in gene expression comparing vancomycin treatment to no antibiotic treatment after 
20, 30, 45, 60, and 90 min of antibiotic exposure. Sequence conservation: —log;o Smith- 
Waterman distance across all pairs of homologous sequences. Sequence prevalence: percentage 
of strains in the S. pneumoniae pangenome that have a homolog of the gene. Essential genes: 
genes necessary for survival, as determined by Tn-Seq. Fitness change: change in fitness 
comparing vancomycin treatment to a no antibiotic control as determined by Tn-Seq. Mutation 
frequency: frequency of each mutation in a population adapted to vancomycin. Adapted gene: 
gene containing at least one mutation that is fixed at high frequency, and is specific to the 
vancomycin adapted populations. (c) Classification of adapted genes and non-adapted genes. 
Receiver operating characteristic curve for a support vector machine trained with all data from 
(a, b) (blue) and one trained with pangenome sequence conservation and sequence prevalence 
omitted (red). The inclusion of these two pangenomic features improves the performance of the 
classifier 
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2017). Such models can be based on several types of data, including those obtained 
from the genomic sequence of an organism (codon usage, orthology, GC content, 
etc.) or from experimental data such as expression profiles or network topology 
(Mobegi et al. 2017b; Nigatu et al. 2017). The accuracy of the latter models relies on 
omics data obtained from species where functional genomics experiments could be 
performed whereas models based on sequencing are more suitable for poorly studied 
organisms. Importantly, it is becoming clear that the static concept of gene essenti- 
ality is no longer valid. Instead, essentiality is a context-dependent attribute affected 
by both the environment and the genetic background of a bacterium (Rancati et al. 
2018). In the simplest case, a gene can be conditionally essential, meaning it is 
essential in a specific environment but not in another, or in the case of a pathogen, a 
gene can be essential in a specific body compartment but not in another. 

To understand how genetic context affects gene essentiality it is important to 
consider the network structure of a genome. Genes in a genome do not act as isolated 
units, but they interact with each other forming a network. The connections that 
shape this network can represent protein—protein interactions, epistatic relationships, 
or transcriptional regulatory interactions (Babu and Madan Babu 2008; Wuchty and 
Uetz 2014; Costanzo et al. 2016). Some genes present a high degree, i.e. a high 
number of interactions connecting them to other genes, while other genes are poorly 
connected (low degree). Essential genes have been shown to have a higher degree 
than nonessential ones in these genetic interaction networks (Jeong et al. 2001; 
Davierwala et al. 2005; Costanzo et al. 2010, 2016; Kim et al. 2012; Jiang et al. 
2015), which is a characteristic that has been used to predict gene essentiality (Shim 
et al. 2017). Interestingly, data from different yeast strains has shown that essential 
genes may be split up into those that are always essential (their loss cannot be 
overcome), and those that are essential depending on the genetic background. The 
loss of essential genes from this latter category can be compensated by the adaptive 
evolution of alternative cellular processes; such essential genes are thereby referred 
to as "evolvable" essential genes (Motter et al. 2008; Liu et al. 2015). As an example 
that this is not limited to yeast, the proteins MreC and MreD, involved in peripheral 
peptidoglycan synthesis, are essential for some S. pneumoniae strains. However, 
different mutations, including the inactivation of the pbpla gene can suppress the 
essentiality of these proteins (Land and Winkler 2011), which classify mreC and 
mreD as “evolvable” essential genes. This evolvability thus at least partially explains 
how the essentiality of genes can depend on the genetic background and underscores 
that it is important to determine a pathogens essentialome at a species-wide level to 
enable the identification of pangenome-wide drug targets. In general, broad- 
spectrum antibiotics work against large groups of different species of bacteria, and 
thus existing drugs often target the "pangenome." Interestingly, these new 
pangenome concepts are creating opportunities to develop drugs that are directed 
at a specific clade. The mevalonate pathway is an example of an essential function 
against which clade-targeting drugs have been developed. This pathway is involved 
in the production of isoprenoids, and has been shown to be essential in different 
Gram-positive bacteria (Wilding et al. 2000; Balibar et al. 2009). The pathway is 
inhibited by an intermediate product, diphosphomevalonate, and fluorinated 
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derivatives of this compound have shown potent antibacterial activities (Kang et al. 
2015). However, while the mevalonate pathway is essential, it has also been shown 
to be evolvable in S. aureus (Reichert et al. 2018) raising the possibility that 
resistance mechanisms can easily arise. Importantly, genomic comparison of differ- 
ent Staphylococci has shown that species either have the mevalonate or the 
non-mevalonate (or 2C-methyl-D-erythritol-A-phosphate, MEP) pathway for the 
biosynthesis of isoprenoids, and specific pathogenic Staphylococci of domestic 
animals have the non-mevalonate pathway (Misic et al. 2016). Based on this 
difference at the genus level, it has been proposed that antibiotics for domestic 
animal Staphylococci targeting the MEP pathway could avoid the emergence of 
antibiotic-resistant determinants in human pathogens. Such clade targeting antibi- 
otics may thus be an interesting strategy, but are only possible if a comprehensive 
understanding of the pangenome is available. 


1.4.3 Targeting Mechanisms of Infection 


In addition to genes essential for general growth, genes required for colonization, 
infection and/or those that damage the host (i.e., virulence factors) are also attractive 
targets for drug therapies (Clatworthy et al. 2007; Rasko and Sperandio 2010; Allen 
et al. 2014; Dickey et al. 2017). Consequently, resistance mechanisms against 
compounds targeting these factors (antivirulence drugs) may not easily spread 
outside the host (Allen et al. 2014). Also, antivirulence drugs may be more effective 
against persisters (Kim et al. 2018), and since they are directed at very specific 
targets they could potentially have less of an effect on the natural microbiota of the 
host (Clatworthy et al. 2007; Dickey et al. 2017). Specific antivirulence drugs or 
biologicals at different stages of clinical development, target pathways including the 
production of teichoic acids, biofilm formation, quorum-sensing mechanisms, and 
specific histidine kinases, and are directed against bacteria including the ESKAPE 
pathogens (Matano et al. 2016; Pasquina et al. 2016; Goswami et al. 2017; Dickey 
et al. 2017; Cardona et al. 2018; Huggins et al. 2018). To expand such specific 
therapeutic options, it is necessary to identify a pathogen's genetic requirements for 
infection, for which in vivo Tn-Seq experiments have proven successful (van 
Opijnen and Camilli 2012; de Vries et al. 2017; Le Breton et al. 2017; Shields 
et al. 2018). As with essential genes, requirements for certain genes seem to be 
environment dependent. For example, proline biosynthetic genes in S. pneumoniae 
strain TIGR4 have been shown to be required for infecting mouse lungs, but are 
dispensable for colonizing the nasopharynx (van Opijnen and Camilli 2012). Other 
environmental factors that affect genetic requirements are microbial communities 
and polymicrobial infections. For instance, S. aureus requires 182 genes for a 
successful infection when co-inoculated with P. aeruginosa, but the same genes 
are dispensable if the pathogen is inoculated by itself (Ibberson et al. 2017). By using 
a Tn-Seq approach, it was shown that two different strains of P. aeruginosa required 
different genes to grow in cystic fibrosis sputum, a growth condition that partially 
mimics an in vivo infection (Turner et al. 2015). Moreover, many of the genes 
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Fig. 5 Genetic requirements of strain S. pneumoniae. TIGR4 and its comparison with the species 
pangenome. Tn-Seq experiments performed in strain TIGR4 (red arrowhead) determined candidate 
essential genes, genes required for growth in minimal medium and those required for infection (van 
Opijnen and Camilli 2012). The presence (yellow) and absence (blue) of these genes was 
established in 332 other S. pneumoniae strains. It is clearly shown that many genes identified as 
essential or required for infection in TIGR4 are absent in many other invasive disease isolates 
(Cremers et al. 2015) 


required by S. pneumoniae strain TIGR4 for host colonization are not present in the 
genomes of other clinical isolates of the species (Fig. 5), which underscores that 
virulence determinants are indeed also dependent on genetic background and thus 
only make sense in the context of the pangenome. A successful example of consid- 
eration of the pangenome to develop an antibacterial therapy is the pneumococcal 
vaccine (Berical et al. 2016; Brooks and Mias 2018). The S. pneumoniae capsule is 
one of its most important virulence factors and its diversity is high, with over 
90 types (serotypes) currently described (Geno et al. 2015). Capsules are highly 
antigenic and serotypes differ in polysaccharide residue composition, chemical 
decoration of sugar monomers and length of the polysaccharide chain (Bentley 
et al. 2006). Pneumococcal vaccines are formulated by mixing multiple capsule 
serotypes, which is exemplified by the pneumococcal conjugate vaccine 13 (PCV 13) 
and the pneumococcal polysaccharide vaccine (PPSV23). These vaccines protect 
against 13 and 23 different capsule-type-based strains, respectively (Berical et al. 
2016; Brooks and Mias 2018), and are thereby highly successful in targeting a 
considerable part of the S. pneumoniae pangenome. 
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1.43 Antimicrobial Combination Therapy 


In addition to developing novel therapies, utilizing currently available drugs in a 
more effective way, e.g., through a multidrug strategy (including antibiotic cycling 
and antimicrobial combination therapy), potentially provides enhanced ways to treat 
clinical infections and prevent resistance (Smirnova et al. 2011; Yoshida et al. 2017; 
Firsov et al. 2017). However, it has been shown that the responses to drug—drug 
combinations can be species specific (even among phylogenetically related organ- 
isms), and in some cases strain specific (Brochado et al. 2018). Thus, the application 
of combination therapies presents a challenge with respect to the pangenome. To 
overcome this challenge, it is necessary to get a comprehensive understanding of 
drug—drug interaction outcomes in many species and strains of a species, potentially 
by testing all possible combinations of drugs, and at different concentrations. 
Brochado et al. performed 2883 pairwise drug—drug combinations on six bacterial 
strains from three Gram-negative bacterial species (E. coli, S. typhimurium and 
P. aeruginosa), yielding a total of 17,050 combinations. The authors found that 
70% of the detected drug—drug interactions are species specific, and that 13-30% are 
strain specific, with different interaction outcomes among the strains (Brochado et al. 
2018). Although approaches like these are very important, they can be very time 
consuming and expensive to perform, especially when one considers hundreds of 
species/strains in a pangenome. This has prompted the application of computational 
predictive strategies. One such an approach is INDIGO through which the devel- 
opers were able to identify a group of genes that are predictive of antibiotic 
interactions in E. coli, and use these genes to predict drug interaction outcomes in 
other important pathogens including M. tuberculosis and S. aureus (Chandrasekaran 
et al. 2016). Using such predictive modeling methods considerably reduces the 
number of experiments that need to be performed, potentially making it possible 
to accurately infer drug—drug interaction outcomes on a pangenome scale. 


2 Conclusions 


The development of antibiotic resistance is a complex process that can involve 
multiple modes of adaptation and/or multiple sources of selective pressure. Here 
we argue that a fuller understanding of this process is only possible by viewing it 
through the lens of the pangenome. We have highlighted recent work that demon- 
strates that genetic background plays an important role in how bacteria respond to an 
antibiotic and how they develop resistance. We have explained how species and 
strains with different genetic backgrounds may exhibit (slightly) different adapta- 
tional outcomes in response to antibiotics. Strains within a pangenome may also 
exhibit strain-specific differences in their mechanism and level of resistance as well 
as their ability to evolve resistance. These different outcomes can be put into context 
and partially explained by how antibiotic stress is experienced and processed in 
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strain- and species-specific ways. In this way, antibiotics can contribute to the 
maintenance and shaping of a pangenome by driving adaptive evolution in strain- 
specific ways. 

In addition to providing context for understanding strain- and species-specific 
responses to antibiotics and their development of resistance, the pangenome can 
provide a means of predicting the development of resistance as well as inform the 
development of novel therapeutics. We argue that adaptation to sustained antibiotic 
pressure is not a wholly stochastic process but rather constrained by a strain’s genetic 
background as well as its environmental context. Given these constraints, it is 
increasingly possible to utilize machine learning algorithms to make predictions on 
the probability that a bacterium will evolve resistance. These algorithms can utilize 
multiple layers of data including genomic, transcriptional, and metabolic datasets at 
the pangenome level. Therefore, they will continue to improve as additional datasets 
are generated. Finally, we have considered the role the pangenome could play in 
developing new therapeutics to combat resistant pathogens. Essential genes and 
virulence genes offer attractive targets for developing novel therapeutics; however, 
these targets must be considered within the context of the pangenome due to a variety 
of reasons. Essential genes in one strain may, in fact, be evolvable, while they are 
static in another strain. Virulence targets may also be strain specific or dependent on 
the environmental context of infection. While this may limit the number of targets that 
are present throughout the pangenome, it does offer the possibility of identifying 
targets that are strain or species specific. 
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Part III 
Pangenomics: An Open, Evolving Discipline 


Meta-Pangenome: At the Crossroad A 
of Pangenomics and Metagenomics im 


Bing Ma, Michael France, and Jacques Ravel 


Abstract With the recent technological advancement in cultivation-independent 
high-throughput sequencing, metagenomes have tremendously improved our ability 
to characterize the genomic contents of the whole microbial communities. In this 
chapter, we argue the notion of pangenome can be applied beyond the available 
genome sequences by leveraging metagenome-assembled genomes, to form a com- 
prehensive representation of the genetic content of a taxonomic group in a particular 
environment. We present the concept of the meta-pangenome, a representation of the 
totality of genes belonging to a species identified in multiple metagenomic sam- 
plings of a particular habitat. As an essential component in genome-centric 
pangenome analyses, we emphasize the importance to perform stringent quality 
assessment and validation to ensure the high quality of metagenomic deconvoluted 
genomes. This expansion from the traditional pangenome concept to the meta- 
pangenome overcomes many of the biases associated with whole-genome sequenc- 
ing, and addresses the in vivo ecological context to further develop a systems-level 
understanding of microbial ecosystems. 


Keywords Meta-pangenome - Pan-metagenome - Pangenome - Metagenome - 
Comparative genomics - Metagenome-assembled genome - Intraspecies diversity - 
Metagenomic subspecies - Community ecotype - Habitome 


1 Introduction 


The first microbial genome, Haemophilus influenzae, was sequenced in 1995 
(Fleischmann et al. 1995) with the second, Mycoplasma genitalium, following a 
few months later (Fraser et al. 1995). In analyzing the M. genitalium genome, the 
authors compared its sequence to that of H. influenzae, the only other available 
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genome sequence at the time, providing insights into the ecology and evolution of 
these two microbes. Every subsequent genome comparison enabled the identification 
of shared and unique genetic characteristics between sets organisms. From this 
observation emerged the concept of pangenome, which describes the core (genes 
present in every strain of the species) and accessory (genes present in a subset of 
strains) genomes. Studying the similarities and differences between the genomic 
content of organisms can inform their evolutionary relationships, ecological roles, 
relationship to health, and has revolutionized our understanding of microbial diver- 
sity (Touchman 2010; Xia 2013; Hardison 2003; Miller et al. 2004; France et al. 
2016). 

Over the years, and with significant technological advancement, the number of 
available genome sequences has expanded from a few to a seemingly endless 
catalog. Yet this impressive collection suffers from a rather severe bias toward 
species and strains that are related to human health, amenable to isolation, and/or 
generally tractable. Metagenomics, the sequencing of whole microbial communities, 
is filling in these gaps by characterizing the genomes of entire populations in a 
community without cultivation. In this chapter, we argue the notion of pangenome 
can be applied beyond the available genome sequences by leveraging metagenome- 
assembled genomes (MAGs), to form a comprehensive representation of the genetic 
content of a taxonomic group in a particular environment. We present the concept of 
the meta-pangenome, a representation of the totality of genes belonging to a species 
identified in multiple metagenomic samplings of a particular habitat. This expansion 
from the traditional pangenome concept to the meta-pangenome overcomes many of 
the biases associated with whole-genome sequencing and addresses the in vivo 
ecological context by describing the whole genetic potential of a species in a specific 
environment. Further building on this new concept, one can think of the 
pan-metagenome as the complete genes/proteins catalog of all species found in a 
giving environment. 


2 Metagenome Deconvolution Enables Genome-Centric 
Analyses of Microbial Ecosystems 


An overwhelming majority of microbial species have resisted cultivation in the 
laboratory, largely due to strict, yet unknown, growth requirements (Bakken 1985). 
The cultivation of fastidious microbes requires optimal combinations of nutrients, 
growth temperatures, oxygen levels or even, in some cases, and the presence of key 
microbial partners (Amann et al. 1995; Eckburg et al. 2005). The inability to grow 
these organisms has undoubtedly limited our understanding of the ecology of indig- 
enous microbial communities. State-of-the-art whole community sequencing tech- 
nology via metagenomics has opened the door to in vivo studies of microbial 
populations and communities. By definition, metagenomic sequencing characterizes 
the collection of all the genetic material isolated from an environmental sample 
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without traditional cultivation (Handelsman 2004; Iverson et al. 2012; Mackelprang 
et al. 2011). This has aided the development of systems-level insights into the 
structure and function of microbial ecosystems (Handelsman 2004; Gilbert and 
Dupont 2011). Advancements in sequencing technologies and throughput have, 
and continue to improve our ability to characterize the genomic contents of microbial 
communities down to the rare biosphere (Eckburg et al. 2005; Sogin et al. 2006). 

Metagenomic sequencing results in a dataset of sequence reads that belong to the 
various species that make up the microbial community. Assembly of these datasets 
into stretches of contiguous DNA sequences, termed contigs, can be complicated by 
the presence of conserved genomic regions across species. Development of 
metagenomic specific short reads assembly algorithms and tools that can disentangle 
these similar sequences originating from different taxa has improved the quality of 
metagenomic assemblies (Pevzner et al. 2001), those include IDBA-UD (Peng et al. 
2012), MetaVelvet (Namiki et al. 2012), SOAPdenovo (Li et al. 2010; Luo et al. 
2012), ABYSS (Simpson et al. 2009), Khmer (Pell et al. 2012; Howe et al. 2012), 
Ray-meta (Boisvert et al. 2012), MEGAHIT (Li et al. 2015, 2016), and metaSPAdes 
(Nurk et al. 2017). Binning of these contigs based on genomic characteristics like 
GC content, tetramer frequency, sequence coverage, among others has enabled 
researchers to identify sets of contigs that belong to the same species. These 
advancements have resulted in the concept of metagenome-assembled genomes 
(MAGs), which represent the collection of all contigs or scaffolds from a single or 
closely related strains of a given species. Developments in bioinformatics tools used 
in assembly and binning have made the recovery of genomes from metagenomic 
datasets a routine analysis, including rare species and draft genomes from previously 
uncultivated species (Albertsen et al. 2013). Binning algorithms and tools have been 
reviewed previously (Sangwan et al. 2016; Breitwieser et al. 2017). For each species, 
the genetic contents of all strains in the population are included in a species bin, 
although sequencing depth, library construction methods, presence of host DNAs, 
and other factors may affect the metagenomic sequencing results (Zaheer et al. 2018; 
Pereira-Marques et al. 2019; Bowers et al. 2015). 

MAGs have led to the discovery of a remarkable amount of genomic diversity 
and the characterization of novel bacterial membership. However, MAGs should 
always be used with caution for the reasons discussed above. False positives in 
binning, conflicted, and incomplete MAGs have been observed for a variety of 
different binning tools that can reduce the quality of public genome repositories if 
MAGs are not evaluated carefully (Shaiber and Eren 2019). Multiple studies have 
suggested that downstream MAGs quality assessment and validation steps are 
critical, and available tools published recently to serve such purpose include 
MetaQUAST (Koren and Phillippy 2015), CheckM (Parks et al. 2015), MAGpy 
(Stewart et al. 2019), Anvio (Eren et al. 2015), AMBER (Meyer et al. 2018), and 
DAS tool (Sieber et al. 2018). Further refinement, stringent quality assessment, 
extending assembly length through re-assembly after recruiting reads back to the 
MAGs, and genome completeness assessments are important and necessary steps to 
ensure the fidelity of the MAGs (Eren et al. 2015). High-quality metagenome- 
deconvoluted genomes are essential to perform genome-centric in vivo analyses of 
microbial ecosystems. 
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3 Metagenome-Assembled Genomes Revealed Extensive 
within Community Intraspecies Diversity in a Microbial 
Community 


Microbial populations often composed of multiple strains of each species, and the 
resulting intraspecies diversity could have significant functional and clinical impli- 
cations (Kraal et al. 2014; Greenblum et al. 2015; Oh et al. 2014). Gel microdroplet 
cultivation afforded nearly finished single genomes and revealed substantial intra- 
species diversity within human oral and fecal microbiomes (Fitzsimons et al. 2013). 
Strains of dominant human skin bacterial species were shown to be heterogeneous 
and multiphyletic, which the authors suggested to be the result of micro-scale 
differences in the environment that shaped the ecology and evolution of each 
subpopulation (Oh et al. 2014). Another study reported extensive strain-level vari- 
ation detected in the human gut microbiome using large-scale intraspecies copy 
number variation (Greenblum et al. 2015). This intraspecies variation is thought to 
be associated with obesity and inflammatory bowel disease. These studies highlight 
the complex relationships between within-species diversity and functional capacity, 
linking compositional shifts to subspecies-level variations. 

Intra-species diversity obviously complicates MAGs generation, a problem that is 
compounded by the use of short-read sequencing technology. It is difficult to 
establish linkage and synteny between genotypes in a species genome. Binning 
strategies can separate sequences that belong to different species, but are generally 
not capable of distinguishing between strains of the same species in a metagenomic 
dataset (Huson et al. 2011). There are encouraging developments in binning algo- 
rithms recently that have addressed strain-level resolution from metagenomic short- 
read sequencing such as StrainPhlAn (Truong et al. 2017), ConStrains (Luo et al. 
2015), MetaSNV (Costea et al. 2017), and DESMAN (Quince et al. 2017). However, 
the word “strain” has been used interchangeably with subspecies type, genotype, 
biotype, among others, in metagenome-derived strain-level resolution analyses. 
Although intraspecies diversity can be purged during assembly, the remainder 
often leads to species bins that contain composite genetic information from multiple 
genotypes (strains) of the species. Advancements in chromosome conformation 
capture (Hi-C) and long-read sequencing technologies such as PacBio SMRT 
sequencing and Oxford nanopore technologies could improve strain deconvolution 
from metagenomic data by extending the read length and assembly quality (Frank 
et al. 2016; Tsai et al. 2016; Belton et al. 2012). However, these technologies have 
not been widely adopted probably due to technical limitations. 
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4 A Practical Definition of Meta-Pangenome 


The pangenome has been an important concept and a tool used in comparative 
genomics to dissect microbial diversity. A pangenome generally refers to the entire 
collection of genetic content from all strains of a species (Tettelin et al. 2005; Medini 
et al. 2005; Vernikos et al. 2015). By definition, a pangenome represents all of the 
genetic potentials of a species and is typically determined by homology among sets 
of genes belonging to multiple strains of the species in all environments the species 
is found. Here, we extend the pangenome concept to incorporate metagenome- 
derived genes and genomes. It is a natural extension as MAGs and metagenomic 
contigs have been used to generate species-specific gene catalog and that for all 
species present in a given environment (Ma et al. 2019). We introduce the term, 
meta-pangenome that refers to the union of genes of a species found in a habitat 
using both culture-independent sequencing (metagenome) and culture-based 
sequencing (genome) methods. In computational terms, the meta-pangenome is the 
entire sequence space of a species in an environment. Thus, within a sample, a 
metagenomic species represents known combinations of strains of a species. In this 
chapter, we choose to discuss the meta-pangenome in the context of a species, while 
the meta-pangenome paradigm can be applied to genera or broader of taxonomic 
groups (Lefebure and Stanhope 2007) as well as other domains of life such as fungus 
(McCarthy and Fitzpatrick 2019). The term “pan” itself means “whole” or “every- 
thing", and “meta” as a prefix could mean “with”, “among”, and “beyond”. Together 
the words “meta-pangenome” literally mean whole genomes of a species from 
among samples collected in a given environment. 

Similar to the pangenome concept, a meta-pangenome is bound to a specific 
species. In order to define the meta-pangenome for a species, say species A, we start 
from collecting all available genomes and constructing MAGs of species A from 
metagenomes (illustrated in Fig. 1). We then perform gene calling for these MAGs 
contigs after quality assessment, followed by similarity search to generate homolo- 
gous gene clusters as in conventional pangenome analyses. The final step is to 
perform meta-pangenome size interpolation and extrapolation for species A. This 
procedure can then be repeated for each of the species present in a particular 
environment to define their meta-pangenome. Alternatively, the genetic contents 
characterized in all metagenomes and genomes of a habitat can be collectively 
pooled to generate homologous gene clusters. Taxonomic assignment of the 
resulting gene clusters can then be used to produce meta-pangenomes for each of 
the species present in the habitat. 

We can then apply the concepts of core, accessory, and unique genes to the meta- 
pangenome framework. A species meta-pangenome core genes are those consis- 
tently present in all or almost all metagenomes in a habitat such as wastewater or the 
GI tract, and meta-pangenome-specific genes are only observed a single sample of 
the habitat. The variable or accessory meta-pangenome includes those genes only 
present in a subset of populations. As a metagenome can be considered a snapshot of 
the microbial community genetic potential at the time of collection, the core meta- 
pangenome can be referred as the set of genes being repeatedly observed after 
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Fig.1 Illustration for a workflow to generate a meta-pangenome for a species. The steps could be 
modified. For example, the step of gene calling could be after the step of the pooling all 
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Fig. 2 Species-specific 30000 
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et al. (2019) 
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multiple sampling events. A closed meta-pangenome would thus refer to the case 
where no or very few new genes of the species are added with each additional 
metagenome sequenced. Conversely, a species open meta-pangenome would refer to 
the case where a substantive number of new genes for that species are discovered 
with each additional metagenome sequenced. The core meta-pangenome for a 
species could be quite small, or even nonexistent, if the abiotic and biotic constraints 
on its colonization of the environment are loose or large if these constraints are strict. 

Similar to the original pangenome ecological significance (Tettelin et al. 2005), 
population size and niche versatility are likely to drive the size of a meta-pangenome. 
For example, the meta-pangenome of Gardnerella vaginalis, a highly prevalent 
bacterial colonizer of human vagina, is a collection of all the genes assigned to 
that species derived from all available vaginal metagenomes and genomes. Despite 
hundreds of metagenomes available containing G. vaginalis, this important species 
shows an open meta-pangenome (Fig. 2). On the other hand, Lactobacillus gasseri, 
another important and beneficial vaginal bacterial species demonstrates an essen- 
tially closed meta-pangenome such that new metagenome sequences add relatively 
few genes. An in-depth understanding of the genetic diversity of constituent com- 
munity members and its relation to community dysbiosis will afford the develop- 
ment of novel strategies to evaluate and optimize prevention, diagnostics, and 
treatment for adverse health conditions. 


< 


Fig. 1 (continued) deconvoluted assemblies for a species. Alternatively, the genetic contents 
characterized in all metagenomes and genomes of a habitat can be collectively pooled to generate 
homologous gene clusters. Taxonomic assignment of the resulting gene clusters can then be used to 
produce meta-pangenomes for each of the species present in the habitat 
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5 A Conceptual Framework for Microbial Comparative 
Genomics: Meta-Pangenome, Metagenomic Subspecies, 
and Pan-Metagenome 


Meta-pangenome forms a practical framework that provides unprecedented insights 
into the genetic and functional basis underlying ecological fitness of microbial 
population in an environmental niche. The variable or accessory meta-pangenome 
of a species are the genes only present in a subset but not all of samples, which has led 
to the new concept of “metagenomic subspecies” (Ma et al. 2019). In essence, a 
metagenomic subspecies represents a slice of a species’ meta-pangenome that is 
commonly identified in metagenomic samplings of a habitat. This slice contains the 
genetic contents of a combination of strains that tend to co-occur. In theory, this 
co-occurrence could be driven by interactions among the strains and/or their tendency 
to co-colonize, termed dispersal limitations (Telford et al. 2006). Specific mecha- 
nisms that can lead to the co-existence of multiple strains in a population include 
frequency-dependent selection (Svensson and Connallon 2019), cross-feeding 
(Livingston et al. 2012; Hunt and Bonsall 2009), spatial structure (France and Forney 
2019), resource partitioning (Rosenzweig et al. 1994), and interference competition 
(Kerr et al. 2002), among others. That said, the metagenomic subspecies concept is 
equivalent to a species genetic “ecotype” for an environment. Several metagenomic 
subspecies can exist in a given environment but cannot co-occur within a sample. The 
metagenomic subspecies can be determined in silico by hierarchical clustering over 
the data matrix such as gene prevalence or gene abundance profiles. Further devel- 
opment of relevant pattern recognition tools (supervised or unsupervised) as well as 
the approximation of the population size (number of strains) are important ongoing 
research developments that will contribute to this field. 

The concepts of meta-pangenome and metagenomic subspecies have great value 
to investigate intraspecies diversity within a community and the genetic foundation 
underlining the functions, resilience, resistance or fitness, among others, of microbial 
communities. We term the entire collection of all species’ meta-pangenomes that 
exist in a specific environment the “pan-metagenome,” which is essentially the 
“habitome” that encompasses the genetic landscape of a habitat. For instance, the 
pan-metagenome of the human gastrointestinal (GI) tract is the collection of all 
genes of all species found in the human GI tract (Qin et al. 2010; Li et al. 2014), and 
the pan-metagenome of the human oral communities encompasses the total genetic 
content of all species in the human oral environment (Tierney et al. 2019). The 
concept of pan-metagenome is represented by extensive gene cataloging, such as 
those constructed for the pig (Xiao et al. 2016) or the mouse GI tract (Xiao et al. 
2015). A pan-metagenome of a specific habitat, when used as a catalog of the genetic 
contents, has provided a comprehensive reference framework for the study of 
microbial communities and their interaction with the environment. 

We have recently constructed a pan-metagenome for the human vaginal tract 
named VIRGO (the human vaginal nonredundant gene catalog) using an array of 
urogenital bacterial isolate genomes and vaginal metagenomes (Ma et al. 2019). 
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VIRGO has been shown to be comprehensive and to provide an unbiased represen- 
tation of the genetic diversity of each species found in the vaginal microbiome. In 
building VIRGO, we found that the vast majority of the genetic diversity was 
contributed by MAGs derived from the metagenomic datasets. In fact, the 
metagenomic data used to build VIRGO comprise a much larger genetic diversity 
(high number of nonredundant genes) than that of all combined single isolate 
genome sequences (Fig. 3a, b). This result indicates the importance of extending 
the pangenome concept beyond isolate genome sequences. 

VIRGO has afforded a different view of the vaginal microbiome, where each 
population is composed of complex mixtures of multiple strains, highlighting the 
large amount of intraspecies diversity present in these communities. We found that, in 
general, the majority of a species’ genes are meta-pangenomic accessory genes. For 
example, for Lactobacillus crispatus, the number of meta-pangenomic accessory 
genes is twice as many as the number of meta-pangenomic core genes (Fig. 3c). 
G. vaginalis demonstrated particularly high intraspecies diversity, for which the core 
meta-pangenome does not even exist and the majority of the genes are accessory or 
sample specific, suggesting that the species should be split into multiple different 
species within the genus Gardnerella. We further observed three distinct 
metagenomic subspecies of L. gasseri, among which there were two distinct types 
and the third being a combination of the two (Fig. 3d). This suggests that there is 
environmental specialized co-colonization of L. gasseri strains in the vaginal envi- 
ronment. Future studies are needed to reveal the linkage between specific 
metagenomic subspecies and pathophysiological conditions. 


6 Conclusion Remarks 


The field of comparative genomics has bloomed from that initial genome compar- 
ison two decades ago. Thanks to advancements in cultivation-independent whole 
community sequencing technology and the increased availability of metagenome- 
assembled genomes, we have obtained unprecedented insights into the incredible 
amount of diversity present within microbial populations. Intraspecies diversity 
exceeds that found in our current reference genome databases. The pangenome 
paradigm expanded to metagenome-assembled genomes and metagenomic contigs 
comprehensively profile microbial genetic diversity in a specific habitat. However, 
the incorporation of metagenome-derived genomes has to be performed carefully 
with stringent quality assessment to avoid spurious inflation of gene content. The 
meta-pangenome concept unites pangenomics and metagenomics to obtain a more 
compete and ecologically meaningful view of different ecosystems. Meta- 
pangenomes and pan-metagenomes represent a critical step in the development of 
a systems-level understanding of microbial ecosystems. 
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Fig.3 Intraspecies diversity revealed using VIRGO (human vaginal nonredundant gene catalog) of 
seven vaginal species including L. crispatus, L. iners, L. jensenii, L. gasseri, and G. vaginalis, 
A. vaginae and P. timonensis. (a) Summary of the number (N) of isolate genomes and metagenome 
(MG) samples with more than 80% of their average genome's number of coding genes for a species, 
based on a dataset of 1507 in-house vaginal metagenomes characterized using VIRGO. (b) Boxplot 
of number nonredundant genes in isolate genomes versus vaginal metagenomes. (c) Heatmap of 
presence/absence of L. crispatus nonredundant gene profiles for 56 available isolate genomes and 
413 VIRGO-characterized metagenomes that contained either high (red) or low (blue) relative 
abundance of the species. Hierarchical clustering of the profiles was performed using ward linkage 
based on their Jaccard similarity coefficient. «number of isolate genomes and metagenome samples. 
TMG: Metagenomes «p < 0.05, ***p < 0.001 after correction for multiple comparisons. 
Figure reproduced from Ma et al. (2019) 


Meta-Pangenome: At the Crossroad of Pangenomics and Metagenomics 215 


Acknowledgment The author acknowledges The Gerber Foundation 2018 award. 


Competing Interest Statement The authors declare no competing financial and nonfinancial 
interests. 


References 


Albertsen M, Hugenholtz P, Skarshewski A, Nielsen KL, Tyson GW, Nielsen PH (2013) Genome 
sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple 
metagenomes. Nat Biotechnol 31:533-538 

Amann RI, Ludwig W, Schleifer KH (1995) Phylogenetic identification and in-situ detection of 
individual microbial-cells without cultivation. Microbiol Rev 59:143-169 

Bakken LR (1985) Separation and purification of bacteria from soil. Appl Environ Microbiol 
49:1482-1487 

Belton JM, McCord RP, Gibcus JH, Naumova N, Zhan Y, Dekker J (2012) Hi-C: a comprehensive 
technique to capture the conformation of genomes. Methods 58:268-276 

Boisvert S, Raymond F, Godzaridis E, Laviolette F, Corbeil J (2012) Ray meta: scalable de novo 
metagenome assembly and profiling. Genome Biol 13:R122 

Bowers RM, Clum A, Tice H, Lim J, Singh K, Ciobanu D, Ngan CY, Cheng JF, Tringe SG, Woyke 
T (2015) Impact of library preparation protocols and template quantity on the metagenomic 
reconstruction of a mock microbial community. BMC Genomics 16:856 

Breitwieser FP, Lu J, Salzberg SL (2017) A review of methods and databases for metagenomic 
classification and assembly. Brief Bioinform 20(4):1125-1136 

Costea PI, Munch R, Coelho LP, Paoli L, Sunagawa S, Bork P (2017) metaSNV: a tool for 
metagenomic strain level analysis. PLoS One 12:e0182392 

Eckburg PB, Bik EM, Bernstein CN, Purdom E, Dethlefsen L, Sargent M, Gill SR, Nelson KE, 
Relman DA (2005) Diversity of the human intestinal microbial flora. Science 308:1635-1638 

Eren AM, Esen OC, Quince C, Vineis JH, Morrison HG, Sogin ML, Delmont TO (2015) Anvi'o: an 
advanced analysis and visualization platform for ’omics data. PeerJ 3:e1319 

Fitzsimons MS, Novotny M, Lo CC, Dichosa AE, Yee-Greenbaum JL, Snook JP, Gu W, 
Chertkov O, Davenport KW, McMurry K et al (2013) Nearly finished genomes produced 
using gel microdroplet culturing reveal substantial intraspecies genomic diversity within the 
human microbiome. Genome Res 23:878—888 

Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb 
JF, Dougherty BA, Merrick JM et al (1995) Whole-genome random sequencing and assembly of 
Haemophilus influenzae Rd. Science 269:496—512 

France MT, Forney LJ (2019) The relationship between spatial structure and the maintenance of 
diversity in microbial populations. Am Nat 193:503—513 

France MT, Mendes-Soares H, Forney LJ (2016) Genomic comparisons of lactobacillus crispatus 
and lactobacillus iners reveal potential ecological drivers of community composition in the 
vagina. Appl Environ Microbiol 82:7063-7073 

Frank JA, Pan Y, Tooming-Klunderud A, Eijsink VGH, McHardy AC, Nederbragt AJ, Pope PB 
(2016) Improved metagenome assemblies and taxonomic binning using long-read circular 
consensus sequence data. Sci Rep 6:25373 

Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA, Fleischmann RD, Bult CJ, Kerlavage 
AR, Sutton G, Kelley JM et al (1995) The minimal gene complement of mycoplasma 
genitalium. Science 270:397—403 

Gilbert JA, Dupont CL (2011) Microbial metagenomics: beyond the genome. Annu Rev Mar Sci 
3:347-371 


216 B. Ma et al. 


Greenblum S, Carr R, Borenstein E (2015) Extensive strain-level copy-number variation across 
human gut microbiome species. Cell 160(4):583-594 

Handelsman J (2004) Metagenomics: application of genomics to uncultured microorganisms. 
Microbiol Mol Biol Rev 68:669—685 

Hardison RC (2003) Comparative genomics. PLoS Biol 1:E58 

Howe A, Pell J, Canino-Koning R, Mackelprang R, Tringe S, Jansson J, Tiedje JM, Brown CT 
(2012) Illumina sequencing artifacts revealed by connectivity analysis of metagenomic datasets 

Hunt JJ, Bonsall MB (2009) The effects of colonization, extinction and competition on co-existence 
in metacommunities. J Anim Ecol 78:866—879 

Huson DH, Mitra S, Ruscheweyh HJ, Weber N, Schuster SC (2011) Integrative analysis of 
environmental sequences using MEGAN4. Genome Res 21:1552-1560 

Iverson V, Morris RM, Frazar CD, Berthiaume CT, Morales RL, Armbrust EV (2012) Untangling 
genomes from metagenomes: revealing an uncultured class of marine Euryarchaeota. Science 
335:587—590 

Kerr B, Riley MA, Feldman MW, Bohannan BJ (2002) Local dispersal promotes biodiversity in a 
real-life game of rock-paper-scissors. Nature 418:171—174 

Koren S, Phillippy AM (2015) One chromosome, one contig: complete microbial genomes from 
long-read sequencing and assembly. Curr Opin Microbiol 23:110-120 

Kraal L, Abubucker S, Kota K, Fischbach MA, Mitreva M (2014) The prevalence of species and 
strains in the human microbiome: a resource for experimental efforts. PLoS One 9:e97279 

Lefebure T, Stanhope MJ (2007) Evolution of the core and pan-genome of streptococcus: positive 
selection, recombination, and genome composition. Genome Biol 8:R71 

Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K et al (2010) De novo 
assembly of human genomes with massively parallel short read sequencing. Genome Res 
20:265-272 

Li J, Jia H, Cai X, Zhong H, Feng Q, Sunagawa S, Arumugam M, Kultima JR, Prifti E, Nielsen T 
et al (2014) An integrated catalog of reference genes in the human gut microbiome. Nat 
Biotechnol 32:834—841 

Li D, Liu CM, Luo R, Sadakane K, Lam TW (2015) MEGAHIT: an ultra-fast single-node solution 
for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 
31:1674-1676 

Li D, Luo R, Liu CM, Leung CM, Ting HF, Sadakane K, Yamashita H, Lam TW (2016) 
MEGAHIT v1.0: a fast and scalable metagenome assembler driven by advanced methodologies 
and community practices. Methods 102:3-11 

Livingston G, Matias M, Calcagno V, Barbera C, Combe M, Leibold MA, Mouquet N (2012) 
Competition-colonization dynamics in experimental bacterial metacommunities. Nat Commun 
3:1234 

Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q et al (2012) SOAPdenovo2: an 
empirically improved memory-efficient short-read de novo assembler. GigaScience 1(1):18 

Luo C, Knight R, Siljander H, Knip M, Xavier RJ, Gevers D (2015) ConStrains identifies microbial 
strains in metagenomic datasets. Nat Biotechnol 33:1045-1052 

Ma B, France M, Crabtree J, Holm J, Humphrys M, Brotman R, Ravel J (2019) VIRGO, a 

comprehensive non-redundant gene catalog, reveals extensive within community intraspecies 

diversity in the human vagina. bioRxiv 

Mackelprang R, Waldrop MP, DeAngelis KM, David MM, Chavarria KL, Blazewicz SJ, Rubin 

EM, Jansson JK (2011) Metagenomic analysis of a permafrost microbial community reveals a 

rapid response to thaw. Nature 480:368—371 

McCarthy CGP, Fitzpatrick DA (2019) Pan-genome analyses of model fungal species. Microb 

Genom 5:e000243 

Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R (2005) The microbial pan-genome. Curr 

Opin Genet Dev 15:589—594 

Meyer F, Hofmann P, Belmann P, Garrido-Oter R, Fritz A, Sczyrba A, McHardy AC (2018) 
AMBER: assessment of Metagenome BinnERs. Gigascience 7 


Meta-Pangenome: At the Crossroad of Pangenomics and Metagenomics 217 


Miller W, Makova KD, Nekrutenko A, Hardison RC (2004) Comparative genomics. Annu Rev 

Genomics Hum Genet 5:15-56 

Namiki T, Hachiya T, Tanaka H, Sakakibara Y (2012) MetaVelvet: an extension of velvet 

assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res 

40:e155 

Nurk S, Meleshko D, Korobeynikov A, Pevzner PA (2017) metaSPAdes: a new versatile 
metagenomic assembler. Genome Res 27:824-834 

Oh J, Byrd AL, Deming C, Conlan S, Program NCS, Kong HH, Segre JA (2014) Biogeography and 
individuality shape function in the human skin metagenome. Nature 514:59-64 

Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW (2015) Check M: assessing the 
quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome 
Res 25:1043-1055 

Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT (2012) Scaling metagenome 
sequence assembly with probabilistic de Bruijn graphs. Proc Natl Acad Sci U S A 
109:13272-13277 

Peng Y, Leung HC, Yiu SM, Chin FY (2012) IDBA-UD: a de novo assembler for single-cell and 
metagenomic sequencing data with highly uneven depth. Bioinformatics 28:1420-1428 

Pereira-Marques J, Hout A, Ferreira RM, Weber M, Pinto-Ribeiro I, van Doorn LJ, Knetsch CW, 
Figueiredo C (2019) Impact of host DNA and sequencing depth on the taxonomic resolution of 
whole Metagenome sequencing for microbiome analysis. Front Microbiol 10:1277 

Pevzner PA, Tang H, Waterman MS (2001) An Eulerian path approach to DNA fragment assembly. 
Proc Natl Acad Sci U S A 98:9748—9753 

Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, Nielsen T, Pons N, Levenez F, 
Yamada T et al (2010) A human gut microbial gene catalogue established by metagenomic 
sequencing. Nature 464:59—65 

Quince C, Delmont TO, Raguideau S, Alneberg J, Darling AE, Collins G, Eren AM (2017) 
DESMAN: a new tool for de novo extraction of strains from metagenomes. Genome Biol 
18:181 

Rosenzweig RF, Sharp RR, Treves DS, Adams J (1994) Microbial evolution in a simple unstruc- 
tured environment: genetic differentiation in Escherichia coli. Genetics 137:903—917 

Sangwan N, Xia F, Gilbert JA (2016) Recovering complete and draft population genomes from 
metagenome datasets. Microbiome 4:8 

Shaiber A, Eren AM (2019) Composite metagenome-assembled genomes reduce the quality of 
public genome repositories. MBio 10(3):e00725—e00719 

Sieber CMK, Probst AJ, Sharrar A, Thomas BC, Hess M, Tringe SG, Banfield JF (2018) Recovery 
of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nat 
Microbiol 3:836—843 

Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I (2009) ABySS: a parallel 
assembler for short read sequence data. Genome Res 19:1117-1123 

Sogin ML, Morrison HG, Huber JA, Mark Welch D, Huse SM, Neal PR, Arrieta JM, Herndl GJ 
(2006) Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proc Natl 
Acad Sci U S A 103:12115-12120 

Stewart RD, Auffret MD, Snelling TJ, Roehe R, Watson M (2019) MAGpy: a reproducible pipeline 
for the downstream analysis of metagenome-assembled genomes (MAGs). Bioinformatics 
35:2150-2152 

Svensson EI, Connallon T (2019) How frequency-dependent selection affects population fitness, 
maladaptation and evolutionary rescue. Evol Appl 12:1243-1258 

Telford RJ, Vandvik V, Birks HJ (2006) Dispersal limitations matter for microbial morphospecies. 
Science 312:1015 

Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, 

Jones AL, Durkin AS et al (2005) Genome analysis of multiple pathogenic isolates of Strepto- 

coccus agalactiae: implications for the microbial *pan-genome". Proc Natl Acad Sci U S A 

102:13950-13955 


218 B. Maet al. 


Tierney BT, Yang Z, Luber JM, Beaudin M, Wibowo MC, Baek C, Mehlenbacher E, Patel CJ, 
Kostic AD (2019) The landscape of genetic content in the gut and Oral human microbiome. Cell 
Host Microbe 26:283—295. e288 

Touchman J (2010) Comparative genomics. Nat Educ Knowl 3:13 

Truong DT, Tett A, Pasolli E, Huttenhower C, Segata N (2017) Microbial strain-level population 
structure and genetic diversity from metagenomes. Genome Res 27:626-638 

Tsai YC, Conlan S, Deming C, Program NCS, Segre JA, Kong HH, Korlach J, Oh J (2016) 
Resolving the complexity of human skin metagenomes using single-molecule sequencing. 
MBio 7:e01948-e01915 

Vernikos G, Medini D, Riley DR, Tettelin H (2015) Ten years of pan-genome analyses. Curr Opin 
Microbiol 23:148-154 

Xia X (2013) Comparative genomics. In Briefs in Genetics. Springer, Heidelberg 

Xiao L, Feng Q, Liang S, Sonne SB, Xia Z, Qiu X, Li X, Long H, Zhang J, Zhang D et al (2015) A 
catalog of the mouse gut metagenome. Nat Biotechnol 33:1103-1108 

Xiao L, Estelle J, Kiilerich P, Ramayo-Caldas Y, Xia Z, Feng Q, Liang S, Pedersen AO, Kjeldsen 
NJ, Liu C et al (2016) A reference gene catalogue of the pig gut microbiome. Nat Microbiol 
1:16161 

Zaheer R, Noyes N, Ortega Polo R, Cook SR, Marinier E, Van Domselaar G, Belk KE, Morley PS, 
McAllister TA (2018) Impact of sequencing depth on the characterization of the microbiome 
and resistome. Sci Rep 8:5890 


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, 
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate 
credit to the original author(s) and the source, provide a link to the Creative Commons licence and 
indicate if changes were made. 

The images or other third party material in this chapter are included in the chapter's Creative 
Commons licence, unless indicated otherwise in a credit line to the material. If material is not 
included in the chapter's Creative Commons licence and your intended use is not permitted by 
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from 
the copyright holder. 


Pangenome Flux Balance Analysis Toward 2) 
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Abstract Studies of the pangenome have been empowered by an exponentially 
increasing amount of strain-specific genome sequencing data. With this data deluge 
comes a need for new tools to contextualize, analyze, and interpret such a vast 
amount of information. Network reconstructions, genome-scale metabolic models 
(GEMs), and the corresponding computational analysis frameworks such as flux 
balance analysis (FBA) have been proven useful toward this end. Network recon- 
structions can be used to interpret genomic variation not just from a single strain but 
for an entire species. By applying these approaches at the pangenome scale, it 
becomes possible to systematically evaluate phenotypic properties for an entire 
species thus enabling the study of a panphenome directly from a pangenome. 
Applying insights gained from analysis of the panphenome has diverse implications 
with applications ranging from human health to metabolic engineering. The future of 
pangenomics will include panphenomic analyses, thus supplementing traditional 
pangenomic analyses and helping to address the Big-data-to-knowledge grand 
challenge of analyzing thousands of genomic sequences. 


Keywords Flux balance analysis - Genome-scale modeling - Panphenome - Multi- 
strain - Comparative systems biology 


1 Introduction 


Studying differences between strains of a species using the construct of a 
pangenome revolutionized the field of comparative genomics for bacteria (Tettelin 
et al. 2005; Medini et al. 2005). This framework allowed scientists to overcome 
issues related to species with high genomic variability and lack of a reference 
genome. The pangenome alone cannot be used to quantify the phenotypic effects 


C. J. Norsigian - X. Fang - B. O. Palsson - J. M. Monk (P4) 
Department of Bioengineering, University of California San Diego, La Jolla, CA, USA 
e-mail: jmonk Q ucsd.edu 


© The Author(s) 2020 219 
H. Tettelin, D. Medini (eds.), The Pangenome, 
https://doi.org/10.1007/978-3-030-38281-0 10 


220 C. J. Norsigian et al. 


of genetic variability. Over the past decade, network reconstructions have become an 
indispensable tool in molecular systems biology because of their ability to provide a 
mechanistic link between experimental studies and computational analyses (Bordbar 
et al. 2014). Thus, genome-scale network reconstructions provide an avenue for 
extending the power of the pangenome toward evaluating the phenotypic capabilities 
of a species or the panphenome. High-quality reconstructions can be expanded 
through bioinformatic techniques to map information from a reference strain to 
additional strains of the target organism. This chapter describes how reconstructions 
and genome-scale models have been applied to study the pangenome by predicting 
all possible phenotypes for strains in a species. Using these tools, large-scale 
genomic data sets combined with experimental phenotypes can now be integrated 
and queried to systematically probe the diversity of strains within a species. 
Genome-scale metabolic network reconstructions can delineate conserved and 
unique metabolic capabilities across the strains of a species. These differences and 
designations can be used to define the metabolic potential of a species often 
informative of lifestyle diversity. In this chapter, we detail the following elements 
toward true panphenomic analysis: (1) The foundation of reconstructions and flux 
balance analysis; (2) The extension of these tools using a “multi-strain” approach to 
calculate metabolic panphenomes for several bacterial species; and (3) A future 
perspective on the multi-strain approach: moving beyond metabolism for a full 
calculation of the panphenome. 


2 Network Reconstructions and Flux Balance Analysis 


The growing collections of sequences that have been used to study pangenomes are 
laden with valuable information, however, strings of nucleotide bases alone do not 
make this information easily accessible or immediately apparent. Thus, there is a 
critical need for tools that can be used to interrogate this massive amount of data to 
generate new knowledge. Genome-scale network reconstructions in concert with 
flux balance analysis (FBA) provide such a tool. This section describes the process 
of reconstruction as well as mathematical approaches that can be used to query and 
compute with reconstruction, in particular, FBA. 


2.1 Network Reconstructions Structure Biological Knowledge 


Genome-scale reconstructions are organism-specific knowledge bases. They are 
built systematically using a quality-controlled bottom-up workflow that incorporates 
genome annotation, omics data sets, and legacy knowledge. The literature detailing 
the construction and analysis of network reconstructions is extensive (O'Brien et al. 
2015; Thiele and Palsson 2010; Herrgard et al. 2008). In brief, these tools organize 
knowledge by linking genes, gene products, and cellular components (Fig. 1a). 
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Fig. 1 (a) Reconstructions consist of layered information connecting annotated genes on the 
genome sequence to their encoded biological products (e.g., RNA, protein) and how those compo- 
nents interact with other biological components (e.g., protein metabolite, in the case of a metabolic 
reaction/transformation. Figure reprinted from Reed et al. (2006). (b) Genome-scale models exist 
for species across the tree of life that are being made for new species and constantly improving. 
Reprint from Monk et al. (2014). (c) Reconstructions can be converted to a mathematical format by 
account for use of biological components (e.g., consumption/production). This allows for molecular 
accounting and enforcement of constraints. (d) Enforcement of constraints (e.g., media updates) and 
applying an objective (e.g., production of biomass, e.g., growth) allows for simulation of biological 
phenotypes from the genotype. Panel c and d reproduced from O'Brien et al. (2015). Reprint from 
O'Brien et al.) 


Reconstructions can be made for several cellular processes including transcriptional 
regulation (Gianchandani et al. 2006, 2009), expression (Thiele et al. 2009) and 
metabolism (Feist et al. 2009). The reconstruction approach is iterative and thus all 
reconstructions are continually improving as new knowledge is generated. Thus, 
reconstructions serve as a valuable resource to integrate and reconcile biochemical 
data allowing researchers to collaborate, test, and readily share new hypotheses 
about functions in a target organism (Monk et al. 2014). 

Reconstructions of cellular metabolism have been the most developed and exten- 
sively used type thus far (Bordbar et al. 2014). Metabolic network reconstructions 
are composed of all known metabolic genes, their encoded proteins and catalyzed 
reactions. This information is synthesized by aggregating organism-specific data- 
bases, high-throughput data, and primary literature (Thiele and Palsson 2010). 
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Advancements have allowed for partial automation of this process (Henry et al. 
2010; Agren et al. 2013). Reactions are organized into pathways, pathways into 
subsystems, and ultimately into genome-scale networks; thus, representing biolog- 
ical processes at multiple scales. The resulting network reconstruction is a unifica- 
tion of the information available for an organism with a genetic basis. Today, there 
exist collections of genome-scale reconstructions for a number of target organisms 
across the tree of life (Oberhardt et al. 2011; Monk et al. 2014) (Fig. 1b). For 
example, as of 2018, there are 178 available, curated reconstructions spanning the 
tree of life (http://systemsbiology.ucsd.edu/InSilicoOrganisms/OtherOrganisms). 
While this coverage is impressive, several other phyla remain devoid of any recon- 
struction initiative. To fully extend the study of panphenomes to all sequenced 
organisms, new reconstruction efforts must be initiated (Monk et al. 2014). 


2.2 Flux Balance Analysis Enables Computation 
of Phenotype from Genotype 


Reconstructions alone are static, and unable to be used for predictions. A major value 
of the metabolic reconstructions emerges when they are converted into a mathemat- 
ical format, enabling computational interrogation using a variety of methods (Orth 
et al. 2010; Lewis et al. 2012). This conversion translates the biochemical reactions 
of a reconstructed network via tabulation of reaction stoichiometry into a chemically 
accurate mathematical format that becomes the basis for a genome-scale model 
(GEM) (Fig. 1c). The flow of metabolites through the network is constrained by 
these stoichiometries represented as balances or inequalities for bounds (Reed 2012). 
Further constraints can be added to a network such as thermodynamic reversibility 
constraints and limitations to nutrient uptake or by-product secretion. Computation- 
ally predicted network states consistent with imposed constraints are potential 
physiological states of the target organism within a defined condition. 

Flux balance analysis (FBA) can be applied to these models for prediction of an 
organism's phenotype. This mathematical approach for analyzing the flow of metab- 
olites through a metabolic network is the original constraints-based method (Orth 
et al. 2010). This approach relies on an assumption of steady-state growth and mass 
balance. FBA uses the stated objective (for example, biomass production, e.g., 
growth) to find the solution(s) using linear programming that optimize an objective 
function (O'Brien et al. 2015). In a defined environment (defined inputs), GEMs can 
be used to compute network outputs (Fig. 1d) FBA allows for computational tracing 
of balanced reaction states beginning with defined inputs to produce output metab- 
olites. Biomass synthesis is computed using FBA by computing the balanced 
reactions states that produce all the required metabolites for growth simultaneously. 
Additionally, the model accounts for the energetic, redox, and chemical balances 
that must also be maintained (O’Brien et al. 2015). 
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Using this technique, a variety of phenotypes such as the effect of gene knock- 
outs, metabolite secretion, and growth capabilities on different substrates can be 
predicted rapidly and compared to experimental results to verify their accuracy 
(Monk and Palsson 2014). Some of the best models have accuracies >90% in 
agreement with experimental data (Monk et al. 2017; Brunk et al. 2018). In this 
way, GEMs provide a way to bridge the genotype to phenotype gap by providing a 
robust platform for analyzing the integrated mechanisms of gene products to produce 
unique phenotypic states. The utility of a highly curated GEM and the corresponding 
computational analyses is increased by the format’s scalability. Through this meth- 
odology, phenotypes for the plethora of sequenced strains within a species become 
readily calculable. In the next section, we will highlight how high-quality recon- 
structions for a single strain can be extrapolated onto several strains of the same 
species to study the phenotypic potential of the pangenome and to gain insight into 
strain-specific metabolic capabilities. 


3 The Multi-Strain Approach: Extending Genome-Scale 
Models to Robustly Explore the Pangenome Phenotypic 
Space 


Once a high-quality reconstruction and genome-scale model exist, its contents (e.g., 
genes, metabolites, and reactions) can be mapped onto other, closely related strains 
in a species. Following this multi-strain approach, tools from comparative genomics 
(Monk and Bosi 2018) can be integrated with genome-scale modeling to identify 
genetic determinants underlying variability of phenotypes. Such a task is crucial to 
understand the evolutionary trajectories of a bacterial species. Strain-specific meta- 
bolic diversity has been illuminated through the use of genome-scale metabolic 
models. Prediction of unique metabolic capabilities and auxotrophies can be used 
to study species lifestyle diversity. This approach is scalable to the pangenome level 
and in turn enables panphenome analysis, thus empowering species-wide compara- 
tive systems biology. This multi-strain approach has been applied to several species 
in a variety of studies and we provide a brief overview of the key insights here. 


3.1 Genesis of the Multi-Strain Approach: Studying 
Escherichia coli 


The first instance of the multi-strain approach as described here was executed by 
Monk et al. where the authors leveraged a curated genome-scale model of E. coli 
K-12 MG1655 that has been continually updated over 15 years to construct genome- 
scale models of 55 other fully sequenced E. coli strains (Monk et al. 2013). Using 
FBA on all 55 of these models, the authors were able to extensively investigate the 
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Fig. 2 (a) Genome-scale models can be used to predict growth capabilities in different environ- 
ments and nutritional niches. This figure represents growth predictions for 55 different strain- 
specific models of E. coli and Shigella on over 300 different carbon, nitrogen, phosphorus and 
sulfur sources. Strains, for the most part, clustered according to their isolated niches (e.g., extra 
versus intestinal). Reproduced from Monk et al. (2013). (b) Using these growth predictions allows 
for the classification of strains and their potential isolation site (e.g., bladder versus intestine). 
Decision trees could reliably separate ExPEC from InPEC strains. Left panel reproduced from 
Croxen and Finlay (2010). Right panel reproduced from Monk et al. (2013) 


predicted metabolic capabilities of all the strains (Fig. 2a). The authors delineated 
strain-specific auxotrophies and substrate preferences among the set of strains. It is 
important to note that these predictions and insights were gained from sequence 
alone. Further, this study demonstrated the possibility of applying this approach to 
understand cases of patho-adaptation to a given environment and evaluate a given 
strain's infectious niche. 
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Further work scaled up the effort to include 1200 strains of E. coli and demon- 
strated a large amount of variability within the species both in gene content and 
consequent variability of gene products (Monk et al. 2017). It also utilized the 
differences across the 1200 strains to construct a robust classification tree for 
determination between extra-intestinal and intra-intestinal pathogens using predicted 
metabolic phenotypes (Fig. 2b). This type of classification schema opens the door to 
investigating how strain-specific traits impact the microbiome. An in-depth example 
of such analyses came in a study by Fang et al. into the metabolic capabilities of 
inflammatory bowel disease (IBD)-associated E. coli strains in the B2 clade (Fang 
et al. 2018). The authors found these strains have advantages in catabolizing sugars 
derived from mucus glycans. The interesting and novel outcomes of these E. coli 
studies clearly demonstrated the value of the approach, and the natural next step was 
to apply the methodology to other species. 


3.2 Expanding the Reach of Multi-Strain Approach Across 
the Phylogenetic Tree 


Numerous studies followed the first E. coli studies that focused on various organ- 
isms. Fouts et al. applied the multi-strain approach, broadened to examine various 
species of Leptospira known to have ranging levels of pathogenicity (Fouts et al. 
2016). They demonstrated that the ability to synthesize vitamin B12 is limited to 
pathogenic species of Leptospira and may give them a survival advantage in a 
human host where B12 is sequestered by the body. This valuable distinguishing 
metabolic capability was captured by being able to leverage the base reconstruction 
across multiple species in the genus. 

In 2016, Bosi et al. applied the workflow to 64 strains of Staphylococcus aureus. 
Beyond reconstructing metabolic capabilities, the approach was extended to identify 
virulence factors in the set of 64 strains (Bosi et al. 2016). By using a combination of 
predicted metabolic capabilities linked to virulence factors, they were able to stratify 
the strains by host type. This study added an additional layer to the promise of the 
multi-strain approach by showing that metabolic capabilities could be analyzed in 
concert with other components of the pangenome, namely virulence factors (toxins, 
adhesins, etc.), and that this combination held predictive power about a strain's host. 
This study also included explicit calculation of the core- and pangenome content of 
S. aureus, a metric of genomic diversity among strains in a species. 

The multi-strain approach has also been applied to other pathogens such as 
Acinetobacter baumannii and Salmonella. In a study by Norsigian et al., a highly 
curated base GEM was used to create models for 75 different A. baumannii strains 
(Norsigian et al. 2018). These strain-specific models demonstrated major differences 
in metabolism between strains indicating that a classification scheme may be possi- 
ble from sequence alone. Seif et al. built strain-specific models for 450 Salmonella 
strains from various serovars to show that metabolic capabilities can be used to 
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distinguish these serovars (Seif et al. 2018). This study indicates that the host-range 
may be limited by metabolic capabilities of different strains. 


3.3 Extending the Multi-Strain Approach to Investigate 
Additional Biological Qualities 


The multi-strain framework provides an inherently efficient means of interrogating 
the properties of many strains and a few studies have utilized this organizational 
efficiency to gain insight into properties outside of direct metabolic capabilities. For 
example, Choudhary et al. examined the agr type of 400 S. aureus strains to examine 
the structure of genes within the genome (Choudhary et al. 2018). The authors found 
that genomic virulence factor profiles are highly correlated with agr type. They also 
identified that divergence in histidine kinase protein confers signal specificity with 
clear differences in protein structural properties based on agr types. Another example 
of additional properties is the investigation of reactive oxygen species (ROS) 
tolerance. By leveraging the multi-strain approach in conjunction with 3D structures 
Mih et al. was able to simulate ROS production levels to demonstrate that antiox- 
idant properties are exhibited in the structural proteome (Mih et al. 2018). A third 
example was conducted by Kavvas et al., who took a deeper level of resolution 
within the genome by looking at the unique alleles present within Mycobacterium 
tuberculosis genomes (Kavvas et al. 2018). Through machine learning techniques on 
the pangenome they were able to associate certain alleles potentially responsible for 
antimicrobial resistance. The results hint at metabolic rewiring at the allelic level 
required for adaptation to antibiotic resistance. The success of the multi-strain 
approach in all these various studies suggests that explicit calculation of the 
panphenome will provide novel insights. 


4 Future Perspectives: Moving Beyond Metabolism: A 
Multi-Scale Approach to Calculating Full Panphenomes 


This chapter details a computational approach (network reconstruction and FBA) to 
systematically calculate metabolic phenotypes for multiple strains in a species. 
Beyond calculation of metabolic phenotypes, new methods, both experimental and 
computational, offer exciting new avenues for research into the pangenome. These 
approaches can be applied at multiple different scales. At the lowest level, single 
nucleotide variants (SNV) can be compared across strains using sequence mapping 
toolkits like breseq and gatk (Deatherage and Barrick 2014; McKenna et al. 2010). 
These approaches can be scaled up from single base changes to full gene sequences to 
compare orthologous ORFs across genomes by comparing sequence-specific alleles 
across strains in a species (Fig. 3a). As described here, the presence/absence of given 
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enzyme-encoding metabolic genes can be used to build strain-specific metabolic 
reconstructions that compute metabolic panphenomes. While most of the applica- 
tions described here are applied to pathogens with relevance to human health, it is 
important to note that the pangenome can also be studied for use in metabolic 
engineering applications. For example, the pangenome can be mined to search for 
enzymes of interest to industrial microbiology (Moscatello and Pfeifer 2018). 

In the future, processes beyond metabolism will also be reconstructed allowing 
for true panphenome calculations. For example, reconstructions of protein expres- 
sion mechanisms already exist (Thiele et al. 2009) and have been integrated with 
models of metabolism (ME models) (O’Brien et al. 2013). These models account for 
the transcription and translation processes and molecular constituents required to 
express enzymes catalyzing metabolic reactions in the metabolic network. It is 
further possible to use the ME model framework to reconstruct proteostatic mech- 
anisms and investigate the structural integrity of the proteome (Chen et al. 2017). In 
the future, multiple ME models of strains in a species will further expand the scope 
of computation possible on contents of the pangenome. 

Beyond metabolism and expression, regulatory networks are another aspect of the 
pangenome that differ between strains and have been reconstructed for individual 
strains (Gianchandani et al. 2006, 2009). Understanding how certain strains regulate 
the same set of genes (core-genome), as well as diverse sets of genes, will further 
expand our understanding of the structure and function of the pangenome. A small- 
scale study of seven E. coli strains and their RNA-seq expression profiles in aerobic 
and anaerobic environments showed remarkably different expression levels even for 
shared genes of the core-genome (Monk et al. 2016) (Fig. 3b). Studying differen- 
tially expressed genes and the transcription factors known to regulate them may lead 
to the discovery of alternative regulatory strategies between strains of a species. 

Just as sequence databases have grown tremendously in recent years, 3D crystal 
structures for the encoded genes have also grown dramatically (Brunk et al. 2016). 
The protein data bank (Berman et al. 2000) (PDB) is a repository of protein 
structures and these structures can now be integrated with genome-scale models 
(GEM-PRO) (Chang et al. 2013). Building multi-strain models with associated 
protein structures is another way to compare strains across a species. Using these 
tools, sequence diversity can be examined at the 3D level to see how mutations line 
up in 3D space, a level of analysis not possible at the sequence level. Furthermore, 
mutations in specific regions of the protein can be tabulated (Fig. 3c) and compared 
across strains (Mih et al. 2018). 

Finally, a multi-strain approach should prove useful for studies of the 
microbiome. Multiple genome-scale models for species found in the microbiome 


Fig. 3 (continued) and BL21). Overall the K-12 strains have a much higher correlation between 
their transcriptional profiles than did BL21. Reproduced from Monk et al. (2016) (c) Expanding 
analysis of sequence similarity by incorporating 3D structural information. The inclusion of 
structures mapped to sequences allows the visualization of how differences in sequences manifest 
in 3D space. (d) Expanding study of strains to the microbiome using metagenomics and strain-level 
resolution. Panels a, c, and d reproduced from Monk et al. (2017) 
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already exist (Magnúsdóttir et al. 2017), and GEM studies were proven effective in 
studying the impact of diet (Shoaie et al. 2015) and interactions between microbes 
(Shoaie et al. 2013). Expanding the multi-strain approach to study diverse strains in 
these species may lead to a deeper level understanding of the gut microbiome 
composition. Indeed, strain-level metagenomics is coming (Scholz et al. 2016) and 
expanding the study of the pangenome to the microbiome will have fruitful appli- 
cations in the near future (Fig. 3d). 

In closing, we must list some caveats and risks to the multi-strain approach. First, 
all of these approaches require high-quality sequence data connected to high quality, 
QC/QA data generation. The success of reliable and maximally effective future 
panphenomics rests on ensuring this quality. There must be a continued effort to 
ensure that sequencing projects are of quality not only quantity. Additionally, an 
interesting question pertaining to the concept of closed pangenomes is, how will the 
law of diminishing returns be exhibited in these sequence deposits? Will a point be 
reached where additional sequences provide no novel information? Further, the 
vision of the panphenome and its implications to understanding how microbial 
pathogens impact human health will rely on both the availability of metadata and 
the deposition of strains. Metadata on these strains will only deepen the possible 
questions to be asked of both pangenomes and panphenomes. A centralized repos- 
itory of strains will also greatly expedite the experimental verification needed for 
such large computational predictions. The future of the panphenome is apparent and 
with it further explanations at the center of biological causality. 


5 Conclusions 


Significant advancements in DNA sequencing technology have led to an exponential 
increase in the number of sequenced strains. This creates a need for new ways to 
integrate and analyze this ever-increasing amount of sequence information. This 
need will only intensify as the number of sequenced strains within a species 
continues to grow exponentially. This chapter demonstrates how the pangenome is 
evolving from a theoretical concept to a queryable construct. 

In this chapter, we describe how the foundational aspects of GEMs and FBA can 
be used to predict phenotypic states for multiple strains in a species. The multi-strain 
approach has proven useful in extending this utility in a number of studies providing 
evolutionary insights as well as practical applications. As the library of available 
sequences continues to grow, the possibility of scaling these techniques to the level 
of the pangenome is becoming a reality. The result, a species-wide panphenome, 
would create a deeper level of understanding than the collection of gene content 
within the pangenome alone. 

The ability to systematically characterize an entire species' phenotypic capabil- 
ities will enhance the depth of pangenome analysis possible and pull valuable 
information inherent to genome sequences to the forefront (Fig. 4). The linkages 
and distinct features at the pangenome scale for a species offer obvious value for 
future knowledge generation, especially pertaining to human health and disease. 
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Pan-Genomics Pan-Phenomics 


Fig. 4 The established assembly of the pangenome through the use of genome-scale reconstruc- 
tions and corresponding computational analyses enables the calculation of panphenomes. The 
panphenome increases the depth of analysis possible by providing a framework in which to 
delineate strain-specific phenotypes. This stratification based on sequence similarity allows for 
the determination of which pieces of reconstructed networks are shared among various groups of 
strains in a species. This will continue to further inform the generation of evolutionary hypotheses 


Further, the future potential applications outlined here such as inclusion of expres- 
sion, regulation, and structures into these workflows will only further advance the 
scope of genome-scale science. Genome sequences are laden with critical informa- 
tion and the tools/workflows described in this chapter provide a means for extracting 
this information into actionable knowledge. 
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Abstract Genome methylation in bacteria is an area of intense interest because it 
has broad implications for bacteriophage resistance, replication, genomic diversity 
via replication fidelity, response to stress, gene expression regulation, and virulence. 
Increasing interest in bacterial DNA modification is coming about with investigation 
of host/microbe interactions and the microbiome association and coevolution with 
the host organism. Since the recognition of DNA methylation being important in 
Escherichia coli and bacteriophage resistance using restriction/modification sys- 
tems, more than 43,600 restriction enzymes have been cataloged in more than 
3600 different bacteria. While DNA sequencing methods have made great advances 
there is a dearth of method advances to examine these modifications in situ. 
However, the large increase in whole genome sequences has led to advances in 
defining the modification status of single genomes as well as mining new restriction 
enzymes, methyltransferases, and modification motifs. These advances provide the 
basis for the study of pan-epigenomes, population-scale comparisons among 
pangenomes to link replication fidelity and methylation status along with mutational 
analysis of mutLS. Newer DNA sequencing methods that include SMRT and 
nanopore sequencing will aid the detection of DNA modifications on the ever- 
increasing whole genome and metagenome sequences that are being produced. As 
more sequences become available, larger analyses are being done to provide insight 
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into the role and guidance of bacterial DNA modification to bacterial survival and 
physiology. 


Keywords Population bacterial genomics - Whole genome sequencing - Direct 
RNA sequencing - Gene expression 


1 Introduction 


Bacterial cellular functions are widely impacted via epigenetic modification, includ- 
ing bacteriophage infection, metabolism, virulence, persistence, replication, and 
genome plasticity. DNA modification in bacteria is of great interest because it is 
increasingly being linked to functional regulation processes in the organism and 
disease progression in mammals (Kumar and Rao 2013). DNA methylation was 
first recognized in Escherichia coli as part of restriction/modification systems 
(RMS) that limit and regulate bacteriophage infection. RMS are ubiquitous in the 
bacterial world with 743,600 RM recognized enzymes in 73600 bacteria (http:// 
rebase.neb.com/rebase/rebase.html) (Roberts et al. 2010). Methylation primarily 
occurs at NÓadenine and C°cytosine in many species, but only N^cytosine is found 
in bacteria (Wion and Casadesus 2006; Kumar and Rao 2013). Recently, a new 
modification that regulates the redox status of the cell using DNA modification via a 
unique multifunctional alteration via phosphothioation was identified (Wang et al. 
2019). Subsequently, DNA and RNA methylations were defined to play a central role 
in bacterial phenotypes that were not encoded in the genome but inherited in bacteria 
and do regulate gene expression in bacteria. Post-replication modification allows cells 
to rapidly adjust to local environmental conditions via gene expression changes that 
are not directly linked to genome variation yet require very dynamic shifts for 
survival and growth status. 

An emerging area of investigation is the role of the microbiome on the host 
epigenome. Particular interest is paid to the role of the bacterial involvement in host 
cancer due to dysregulation of gene expression as cancer progresses. A comprehen- 
sive review of the state of progress that links infectious agents to cancer and host 
epigenome proposed that chronic inflammation was involved in the dysregulation of 
gene expression (Rajagopalan and Jha 2018). An intriguing hypothesis is that 
bacterial metabolism in utero can have long-lasting effect by regulating epigenetic 
modification of the maternal and fetal status in utero (Romano and Rey 2018). The 
complexity of the microbiome composition and metabolism leads one to expect a 
very complex system for the bacterial community to regulate the host epigenome. 
Farhana et al. (2018) reviewed the microbiome and its potential role in cancer. Of 
particular interest is that of Helicobacter pylori since it is associated with multiple 
states of disease in the progression from normal tissue to cancer with regional and 
human race differences since it has coevolved with humans for at least 80,000 years 
(Munoz-Ramirez et al. 2017), and it has a complex lifestyle in the microbial 
community within a unique location in the body that forces the organism to manage 
swings in pH, redox, and nutrient sources within minutes. 
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With the emergence of population genomics and metagenomics and large-scale 
whole-genome sequencing the vast amount of information has grown rapidly over a 
short time. With over 350,000 bacterial genomes in the public domain, a new 
challenge has grown in trying to conduct population epigenomes in bacteria and 
then associate those changes with change in the host to promote disease. Chen et al. 
(2014) described a method for population-scale approaches; however, more robust 
methods are now needed that include metagenome analysis as well. 

Comparison of genomes using pangenomes and Big data approaches are 
progressing to link specific genes and alleles to disease. Population genomics is 
beginning to emerge (Weis et al. 2016) but it is disconnected to epigenomes and 
pangenome analysis at this point. Hence, focusing on specific genes and modifica- 
tions is appropriate and providing results that can be linked to population genomics in 
the future. 


2 Bacterial DNA Modifications and Biological Importance 


On a biochemical level, epigenetic modification of the genome changes the acces- 
sibility of specific gene clusters and affinity of transcriptional regulators for their 
cognate promoters. This modulation of transcription accessibility and promoter 
affinity in turn translates to changes in bacterial response to environmental stimuli. 
Because epigenetic modifiers, such as RM systems and specific methyltransferases 
(MTases) themselves, are encoded on the chromosome as well as on plasmids, these 
elements can be transmitted vertically as a result of replication as well as horizontally 
as a result of horizontal gene transfer either via conjugation or phage. As mentioned 
above, DNA modification systems serve to identify and eliminate foreign DNA, but 
these DNA modifications also serve important roles in cell cycle progression, DNA 
repair, and regulation of gene expression. 


2.1 Bacterial Histone-Like Proteins 


Like eukaryotic histones, bacterial histone-like proteins assist in compacting the 
chromosome into a nucleoid structure (Thanbichler et al. 2005). Histone-like proteins 
can be classified into four different categories: histone-like proteins (HU), histone- 
like nucleoid structuring proteins (H-NS), integration host factors (IHF), and factors 
for inversion stimulation (FIS), further reviewed in Dorman and Deighan (2003) and 
Anuchin et al. (2011). To accomplish this task, bacteria utilize histone-like proteins to 
organize their DNA to minimize space utilization but also to regulate the expression 
of their DNA. These proteins work in a concerted manner to bind DNA and facilitate 
supercoiling into a nucleoid structure and regulate gene expression, these mecha- 
nisms were extensively reviewed previously (Dorman and Deighan 2003; 
Thanbichler et al. 2005; Dorman 2013; Takahashi 2014; Grainger 2016). Throughout 
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the cell cycle, different histone-like proteins peak in concentration to regulate genes 
sets responsible for the progression of an actively replicating cell to a stationary phase 
cell, indicating that each one plays a unique role during specific stages of growth. 
Cycling histone-like proteins indicates that the pan-epigenome changes at different 
phases of growth. In addition to being related to different growth phases, expression 
of specific histone-like proteins is also induced in response to environmental stresses. 
The ability of environmental stimuli to change histone association with DNA sug- 
gests that pan-epigenetic shifts occur when an organism adapts to its environment. 
Examples are evident in the existence of microbes adapted to live in extreme 
environments as well as pathogens, such as Brucella, that are specifically adapted 
to live in their host. While these microbes no longer possess genes found in related 
species, it was epigenetic selection that led to the refinement of these genomes. 
Sustained pan-epigenetic shifts result in perpetually inactivated genes that are sub- 
sequently lost in future generations, resulting in differentiation between DNA mod- 
ification and genotypes. 

Although DNA methylation is frequently associated with RM systems and 
bacterial “immunity” against sources of foreign DNA, we are just beginning to 
understand the global impacts of DNA methylation on transcriptional regulation of 
gene expression. In addition to protein-DNA interactions affected by methylation, 
DNA modifications also regulate bacterial histone-like protein binding to DNA. 

While MTases may indirectly impact gene expression through modulating 
histone-like protein-DNA interactions, MTases directly influence gene expression 
through the presence of recognition motifs located in promoter regions and protein- 
binding sites of genes. The methylation state of these regions work by modulating 
the affinity of RNA polymerase and transcriptional regulators such as leucine- 
responsive repeat protein (Lrp) and catabolite activator protein (Cap) to specific 
genes, among which include dnaA, ppiA, yhiP, and the pap operon (Tavazoie and 
Church 1998; Marinus and Casadesus 2009). 

RM systems play a major role in bacterial immunity against foreign DNA. 
Another component of the bacterial "immune system" was recently discovered, 
termed clustered regularly interspaced palindromic repeats/CRISPR-associated 
(CRISPR/Cas). CRISPR systems are detectable in 1126 of the 2480 genomes 
analyzed to date (Grissa et al. 2007). Similar to phase variable regions of the 
genome, CRISPR/Cas systems are composed of short, conserved, DNA repeat 
sequences interspersed by stretches of variable sequences with cas genes adjacent 
to these regions. CRISPR/Cas systems recognize foreign nucleic acids, targeting 
them for degradation via RNA interference effector complexes composed of Cas 
proteins and CRISPR RNAs (Gasiunas et al. 2013). Though no associations between 
MTases and CRISPR/Cas have been proven, Hernández-Lucas et al. determined that 
Salmonella Typhi casA is under H-NS and Lrp regulation (Medina-Aparicio et al. 
2011). In addition to immunity, CRISPR/Cas systems are also hypothesized to affect 
DNA mismatch repair with E. coli Casl involved in DNA segregation and mismatch 
repair (Babu et al. 2011; Westra et al. 2012). MTases and CRISPRs both share a 
number of common interacting partners involved in transcriptional regulation 
including Lrp and H-NS. While much remains to be learned about additional cellular 
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roles of these systems, it is not improbable to expect a synergistic interaction in 
orchestrating essential cell processes. 


2.2 DNA Modifications 


Bacteria encode numerous restriction-modification (RM) systems that can be cate- 
gorized into four main types. RM systems include the restriction endonuclease 
(REase), methyltransferase (MTase), and the specificity protein which facilitate 
targeted RM enzymatic activity to specific regions of DNA. RM systems require a 
specific unit, which enables RM targeting to a DNA recognition domain, a 
methyltransferase that modifies DNA with a methyl group, and an endonuclease 
that cleaves DNA (REase) with four types of RM systems described to date and 
catalogued in Rebase (Roberts et al. 2010). Briefly, Type I is characterized by an 
oligomeric MTase and REase complex with restriction occurring at variable distances 
from the recognition site. As the largest category with over 16,000 MTases identified, 
Type II system fall into numerous subcategories and are composed of either discreet 
or fused, MTase and REase subunits that cleave at or near the recognition site. Type 
III system cut at a fixed site away from the recognition sequence with the restriction 
enzyme activity contingent on association with the cognate MTase. Like Type I, Type 
IV system cleave at a variable distance from the recognition site but unlike the other 
three systems, the Type IV system is able to recognize and cleave hydroxymethylated 
and phosphorothioated DNA in addition to methylated DNA (Vasu and Nagaraja 
2013; Loenen et al. 2014). 

Originally discovered as a protective mechanism against bacteriophage infection, 
MTases selectively transfer the methyl group from SAM to the nitrogen atoms at 
position 4 in cytosine and position 6 of adenine (m^C, mA) or the fifth carbon of 
cytosine (m°C) within specific sequence motifs along the bacterial genome identified 
by the RM system recognition domain (Wilson 1991). These methylated sequences 
are resistant to endonuclease digestion by the restriction enzyme and are recognized 
by the RM system as a means of establishing self from nonself. Any phage DNA 
entering the host is assessed by the RM system and digested by the RM endonuclease 
if methylation is not detected by the corresponding recognition domain. To circum- 
vent host restriction of phage DNA, bacteriophage often introduces their own MTases 
during infection. Due to the nature of RM enzyme-DNA dynamics, these MTases are 
often retained by the host following bacteriophage infection and transferred to 
subsequent generations, giving rise to orphan MTases lacking a reciprocal restriction 
enzyme (Labrie et al. 2010; Murphy et al. 2013). 

Early experiments involving manipulation of RM systems produced viable cells 
with r + m + and r-m + phenotypes. Interestingly, an r + m- phenotype was lethal, 
suggesting that in the absence of DNA methylation, restriction enzymes will digest 
self-DNA, resulting in cell death (Arber 1965). In studying postsegregational 
killing by RM systems, Kobayashi et al. observed a larger amount of MTases 
molecules relative to REase in steady-state cells. However, dysregulation of 
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cellular MTase and REase levels led to increased cell death due to Res-induced 
double-strand breaks in the chromosome (Ichige and Kobayashi 2005). These 
results further highlight a characteristic true of all RM systems in which MTases 
are fully functional without the cognate restriction enzyme; however, the restric- 
tion enzyme activity is contingent on the presence of the MTase. Easy acquisition 
and retention of foreign MTases—termed orphan MTases—by host bacteria con- 
tributes to the increased diversity of MTases in relation to restriction enzymes with 
possible methyltransferase sources being mobile elements acquired through trans- 
duction or mating events (Murphy et al. 2013). 


DNA Adenine Methylation DAM DNA adenine methylation (Dam) is the predom- 
inant methylation found in bacteria and is accomplished by bacterial 
methyltransferases (MTases). Dam MTases are widespread throughout all genera 
of bacteria, with some MTases sharing the same recognition motif and other MTase 
recognition sites being species, if not strain, specific. The presence of hydrophobic 
methyl groups either on both strands of DNA (fully methylated) or a single strand of 
DNA (hemi methylated) serve to modulate gene expression by way of modulating the 
affinity of DNA-binding proteins for specific regions of DNA. 

Survival in a niche environment such as the human body requires careful and 
concerted regulation of numerous genes, ranging from stress response and nutrient 
acquisition to manipulation of host processes in the case of pathogenic bacteria. 
Although bacterial pathogens have coevolved with their hosts (Hongoh et al. 
2005), the standard transmission cycle of some pathogens dictate that they may 
spend some time outside of their human host and in environments that are suboptimal 
in moisture and nutrients but can contain antimicrobial compounds (Harb et al. 2000). 
Transitioning from an environmental lifestyle to a host-adapted lifestyle requires a 
large shift in the gene expression and protein profile of a pathogen. With the 
magnitude of gene regulation needed to facilitate this lifestyle change, it is reasonable 
to consider the role of epigenetics in driving these changes (Low et al. 2001). 


E. coli The pap operon of E. coli encodes the pyelonephritis-associated pilus. While 
pap is under methylation-mediated transcriptional control, Pap expression is also 
regulated by methylation-mediated phase variation. Mechanistically, Dam competes 
with transcriptional regulators, such as Lrp, a global transcriptional activator, for 
access to recognition domains wherein methylation of the domain determines the 
pilus ON/OFF state (Casadesus and Low 2006). Similar mechanisms governing pilus 
formation and phase variation are also documented in many other bacteria including 
Salmonella, S. aureus, H. influenza, Neisseria, and H. pylori (Srikhanta et al. 2005, 
2011). 


Salmonella 'This organism is broadly modified (Table 1) over the genome with 
specific motifs. Within the same Salmonella virulence plasmid, H-NS represses 
finP in a Dam-dependent manner while repressing traJ in a Dam-independent 
manner. These observations bring to light the impact of structural differences in 
nucleoids of dam + vs dam- genomes and the outcome of these structural differences 
on gene expression (Marinus and Casadesus 2009). In addition to histone-like 
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Table 1 Epigenetic modification of selected Salmonella serotypes determined using SMRT 
sequencing (Weimer, unpublished) 


Bareilly Heidelberg Javiana Typhimurium St Paul 


(SAL2881) (CFSAN000318_04) (CFSAN001992 73) \(CFSAN001921_01) 


(SP3) 


5'-GATC-3'/3'-CTAG-5' 


5’-CAGAG-3'/3'-GTCTC-5' 


5'-ATGCAT-3'/3'-TACGTA-5' 


5'-CAGCTG-3'/3-GTCGAC-5" 


5'-GATCAG-3'/3-CTAGTC-5' 


5'-ACCANCC-3'/3'-TGGTNGG-5' 


5'-CCGANSGTC-3'/3-GGCTN5CAG-5' 


5'-GAGN6GRTAYG-3'/3'-CTCN6YATRC-5' 


5'-GN2TAYNSRTGG-3'/3'-CN2ATRN5YACC-5' 


5'-GpsAAC-3'/3'-CTTpsG-5' 


Increasing shades of green indicate higher modification in each isolate with the most being 100% 
and the least being 10%. No shade indicates no modification and bold base indicates location of 
modification 


proteins, DNA methylation, specifically adenine methylation (Dam) is known to be 
involved in regulating host colonization. PhoP, a master regulator of Salmonella 
virulence, binds DNA in a dam-dependent manner (Heithoff et al. 1999). Deletion or 
over expression of an MTase results in whole genome-wide change in transcription 
profiles. While Salmonella Typhimurium Dam mutants do not exhibit growth-related 
deficiencies, Dam-deficient Salmonella exhibits a 10,000-fold increase in the lethal 
dose required to kill 5096 of a mouse population (LDs9) (Low et al. 2001). Transcrip- 
tional profiling of Dam-deficient Salmonella attributes attenuation to an induction of 
spvB, along with over 35 other infection-associated genes and a reduction in sipABC 
transcripts (Garcia-Del Portillo et al. 1999). 

The amount of information in specific organisms that have a minor role in disease 
or lack a large amount of whole genome sequence has very little pan-epigenome 
information. Chen et al. (2017) examined the epigenome of L. monocytogenes 
(Table 2) to find a complex pattern of modification that was not observed to be 
associated with pathogenicity. Virulence genes were heavily methylated, but no 
observable pattern emerged to uncover how methylation was involved in virulence. 


DNA Cytosine Methylation (DCM) Unlike adenine methylation that has been 
functionally characterized in numerous bacterial systems, DNA cytosine methyla- 
tion (Dcm) remains relatively understudied. Best characterized in E. coli, Dcm 
appears to confer resistance against restriction by the REase, EcoRII (Bigger et al. 
1973; Boye and Lobner-Olesen 1990). Functionally, Dcm acts as an antitoxin 
against EcoRII restriction. Because Dcm and EcoRII share the same recognition 
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Table 2 Epigenome prevalence of modification in Listeria monocytogenes isolates involved in a 
foodborne illness outbreak derived from pathogenesis association (Chen et al. 2017) 


Serotype 1/2a 4b 
Methyltransferase Specificity ^ Modified Base E: 
rur no^ 
5'-GATC-3' X 
3'-CTAG-5' 
S. CTGN.COA mea 
3$. CTNACOC A mea 
3 ATGVNAGANCS mea P4 
SMATCYTCS PA 
om rer mme 


Bold letters indicate the modified base. Numbers indicate the percentage of that motif modified in 
the genome using SMRT sequencing. Boxes with two sets of numbers indicates the strand specific 
prevalance methylation 


sequence—C™CWGG—Dem is able to methylate sites that would otherwise be 
targeted for Ecorll restriction (Palmer and Marinus 1994). In this manner, Dcm 
serves a protective function against a parasitic RM system (Takahashi et al. 2002). 
Dem is also associated with mobile element rearrangements in the E. coli genome 
involving bacteriophage lambda recombination and TN3 transposition (Korba and 
Hays 1982; Yang et al. 1989). On a whole genome level, evidence suggests that Dcm 
is involved in transcriptional and translational regulation of ribosome activity to 
decrease the expression of ribosomal proteins during stationary phase (Militello et al. 
2012). 


Phosphorothioate Modification A third, recently discovered DNA modification that 
naturally occurs in bacteria is phosphorothioate (PT) modification wherein the oxygen 
atom in a phosphate moiety of the DNA backbone is replaced by sulfur (Eckstein 2014). 
The ability to carry out PT modifications is contingent on the presence of the dnd gene 
clusters, dndABCDE, the modification component, and dndFGH, the restriction com- 
ponent although their presence can be mutually exclusive (Tong et al. 2018). First 
discovered in Streptomyces lividans, informatics analyses of dnd gene clusters has since 
revealed a wide distribution of PT modifications in bacterial genomes (He et al. 2007; 
Wang et al. 2011, 2019). Abrogation of PT modifications led to increased double- 
stranded DNA breaks in Salmonella and oxidative stress due to significant metabolic 
changes in Pseudomonas fluorescens (Cao et al. 2014; Gan et al. 2014; Tong et al. 
2018). 
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Undiscovered Modifications Next-generation sequencing techniques that incorpo- 
rate measurement of polymerase kinetics can detect structural differences to indi- 
vidual nucleotides that would otherwise have been overlooked (Rhoads and Au 
2015). By comparing the pattern of polymerase kinetics to previously characterized 
patterns, we can informatically identify DNA modifications at the single nucleotide 
level and characterize epigenetic patterns on the whole genome level (Schadt et al. 
2013). The use of this technology in whole genome sequencing has also recorded 
polymerase kinetics patterns that are not yet associated with a known DNA modi- 
fication (Chen et al. 2017). These data suggest that there is unprecedented diversity 
to epigenetic modifications that we have yet to uncover. Epigenetic modifications 
that have been characterized thus far are responsible for numerous physiological 
processes including defense against foreign DNA, gene regulation, and DNA repli- 
cation and mismatch repair. The implications of uncharacterized modifications on 
epigenomic regulation potentially have far-reaching implications for interactions 
within a niche and interaction with the host for survival and persistence. As 
additional advances are made in next-generation sequencing and RNAseq, it may 
be possible to define methylation directly in situ, which is a current limitation. 


2.3 DNA Replication and Chromosome Sorting 


Bacteria encode proteins near their chromosomal origin of replication (oriC) that 
facilitate the timing of replication initiation and help to carry out the chromosome 
segregation during replication (Ogden et al. 1988; Boye and Lobner-Olesen 1990; 
Campbell and Kleckner 1990). Due to the time-sensitive nature of replication 
initiation, DNA replication-associated protein levels must be tightly coordinated 
with cellular replicative machinery. To accomplish this task, bacteria encode a 
higher density of GATC methylation sites around the origin of replication and utilize 
DNA methylation to modulate the affinity of replication-associated proteins to DNA. 
Methylation around oriC regulates the recruitment of replication initiation proteins 
including the initiator of replication, DnaA. Furthermore, GATC methylation motifs 
also exist in the promoter region of dnaA, allowing for transcriptional regulation of 
replication (Campbell and Kleckner 1990). During DNA replication, both copies of 
the chromosome must be accurately sorted into the corresponding cell. After repli- 
cation, DNA is in a hemi-methylated state. Methylation at oriC sequesters the origin 
replication initiation and prevents reinitiation of DNA replication. Additionally, 
global hemi-methylation of newly replicated DNA facilitates chromosome binding 
to designated areas of the cell membrane such that individual chromosomes may be 
accurately partitioned into each daughter cell (Ogden et al. 1988). 
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2.4 Mismatch Repair and Evolution 


Bacterial DNA polymerases are capable to replicating DNA with high fidelity, but 
replication errors still arise at a rate of 107° to 107!" errors per base pair (Drake et al. 
1998). When these replication errors arise, the cell must have a way of identifying 
the correct template with which to correct the mistake. Template and newly repli- 
cated strands of DNA are differentially methylated to differentiate from one another 
with the template being methylated and the newly replicated strand remaining 
unmethylated. First described in Streptococcus pneumoniae and further character- 
ized in E. coli, this methyl-directed mismatch repair system was identified as 
MutHLS (Glickman and Radman 1980; Claverys and Lacks 1986) (Fig. 1). MutS 
binds to mismatched base pairs while the methyl-sensitive endonuclease MutH nicks 
the DNA at the mismatched site. MutL recruits the DNA repair machinery to correct 
the mismatch. Both the loss of MTases and overexpression of MTases are correlated 
with deficient mismatch repair due to a dysregulation between methylation and DNA 
replication kinetics. In dam mutants, the inability to methylate the template strand 
leads to inaccurate mismatch repair and vertical transmission of mutations arising 
from DNA replication. Dam mutants are unable to methylate the template strands of 
replicated DNA, leading MutHLS inability to identify the strand of DNA containing 
the mutation for mismatch repair. In this regard, the pan-epigenome directly influ- 
ences the accumulation of SNPs that arise during replication. Due to the mobile 
nature of RMS systems, over time the loss or acquisition of additional MTase 
systems may influence the global methylation status of a genome. 


3 Epigenetic Detection Methods and Approaches 


Nucleotide modification by methylation is a prevalent feature in living organisms. In 
bacteria, base methylation is a form of defense system against bacteriophage or 
foreign genetic material. The defense system works by detecting sequence motifs of 
nucleotides and cuts it using an endonuclease as a preemptive strike against foreign 
genome. Bacterial DNA is spared from the cutting with the action of the methylase. 
This is known as the restriction-modification system (RMS). Aside from defensive 
function, the restriction modification system also performs genomic regulatory 
functions in bacteria. Due to the huge impact of the restriction modification system 
in the lifestyle of bacteria with regard to pathogenicity, prokaryotic epigenomics is 
an emerging field primarily driven by recent technological advancement in sequenc- 
ing capability. The transformational aspect is mainly on the scalability of methyla- 
tion analysis at the genomic level. This has opened up doors for genome-wide 
methylation analysis. 

What are the key considerations in doing large-scale high-throughput epigenomics 
research? Genome-wide methylation projects’ considerations are determined by 
costs, ease of library construction and preparation, access to equipment or core 
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facility, availability of suitable kits for library construction and downstream bioin- 
formatic analysis. The level of resolution of epigenomic modification data from crude 
to precise distinguishes the possible technological options appropriate for the pipe- 
line. The above-mentioned considerations as well as the underlying technology will 
be covered in the succeeding sections. 


3.1 Pre-sequencing Methods for Genome Methylation: 
LC-MS, HPLC-UV, and ELISA 


The pre-sequencing methods are generally used for basic research and their capa- 
bility to quantify methylation at the genomic scale. While this ability to quantify 
methylation at the genome scale provides a big picture setting of methylation, 
mapping the methylation sites to the specific regions in the genome is not possible. 
The scalability for population-scale bacterial epigenomics is limited and hence has 
limited the applicability of these methods to a few niche research papers. 

The key steps in the analytical workflows are DNA extraction, genomic fragmen- 
tation, enrichment, and quantification using chromatography or mass spectrometry. 
The options for genomic fragmentation are thermal, chemical, and enzymatic hydro- 
lysis. The resulting digested DNA monomers is enriched using size-exclusion, liquid 
extraction, solid phase extraction, or preparative liquid chromatography. Analyte ions 
are separated by the mass-to-charge ratios in mass spectrometry, allowing binning of 
the DNA monomers (Tretyakova et al. 2013). 

Genome wide methylation using analytical methods particularly HPLC-based 
methods have been recently described (Yotani et al. 2018). High-performance liquid 
chromatography-ultraviolet (HPLC-UV) enables quantification and identification by 
separating the different components. This is accomplished by pushing the compo- 
nents using pressurized liquid solvent through a column filled with solid adsorbent 
material. The differences between the materials result to variation in flow rates 
allowing separation of the components. In bacterial DNA methylation analysis, this 
method is applied to quantify the separated methylated and unmethylated 
deoxynucleosides. 

For crude global methylation analysis, numerous commercial ELISA (enzyme- 
linked immunosorbent assay) kits are available. The high level of variance is the 
primary reason for the lack of precision of ELISA kits in epigenomics, but the ease 
of use is sufficient to capture huge differences in methylation. The target DNA is 
immobilized on ELISA plate and specific primary antibody against methylated 
nucleoside is applied followed by a secondary antibody that can be detected using 
colorimetric methods. 

The requirement for specialized equipment for LC-MS and HPLV-UV has 
restricted the use of the following methods for genome-wide methylation. While 
relative quantification is possible, mapping the methylation is not possible and hence 
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population-scale analysis is not possible. The technical challenges of doing the work 
hinders its large-scale application. 


3.2 Next-Generation Sequencing-Based Methods 


The key shortcoming in using analytical methods for bacterial epigenomics is 
inability to identify methylation loci. This deficiency has predominantly filled by 
next-generation sequencing technology that can simultaneously capture sequence 
and methylation data (Fig. 2). The prevailing choice for combined sequencing and 
methylation platform is single molecule real-time (SMRT) sequencing by PacBio. 
Data is captured for °mA, “mC, and mC parallel to sequencing data based on the 
kinetics of DNA synthesis reactions. This enables genome-wide mapping of meth- 
ylated and unmethylated loci. Modified bases have not been a routinely included in 
the Sanger-based sequence analysis and has posed significant technological chal- 
lenge until the arrival next-generation sequencing options. DNA treatment with 
bisulfite converts unmodified cytosine to uracil, enabling discrimination between 
modified and unmodified cytosine using various sequencing platform. 

SMRT sequencing follows the typical workflow for next-generation sequencing 
with library construction after DNA extraction (Kong et al. 2017). The protocols for 
automated PacBio 10 kb library construction have been published, which can 
immensely improve efficiency of performing epigenomic research. A crucial 
requirement for successful high-throughput sequencing run is high molecular weight 
genomic DNA. Agilent 2200 TapeStation Nucleic Acid System has been used to 
determine the quantity and size distribution of purified genomic DNA (Kong et al. 
2014) as well as the 260/280 and 260/230 ratio using Nanodrop 2000 UV-vis 
spectrophotometer (ThermoFisher Scientific, Waltham MA). The DNA integrity 
number (DIN) is a suitable tool for determining the quality of genomic DNA for 
further processing (Kong et al. 2016) and methods exist for automated construction 
of the sequencing library (Kong et al. 2017). The core basis for SMRT sequencing is 
based on restrictions of light illumination of immobilized target DNA and polymer- 
ase using zero-mode waveguide (Rhoads and Au 2015). Signal detection of the 
cleaved fluorescent dye from the nucleotide molecule is the basis for base calling. 
The bulk of the most technically challenging aspect of the analysis is within the post 
sequencing bioinformatic pipeline. DNA methylation detection and quantification 
analysis are done in PacBio SMRT analysis platform (http://www.pacb.com/devnet/ 
code.html). After sequencing, raw reads are trimmed to remove adapter sequences 
and then aligned to a reference using BLASR (v1) (Chaisson and Tesler 2012). DNA 
methylated sites are then determined using kinetic analysis of the genomic align- 
ment. MotifFinder clusters the methylated sites to motifs targeted by methylases. 
This platform also allows discovery of novel restriction-modification genes. Homol- 
ogy is inferred bioinformatically using databases like SeqWare for cloud applica- 
tions (O'Connor et al. 2010). 
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The development in sequencing technology allowed large-scale analysis of pro- 
karyotes (Blow et al. 2016). Base resolution methylation was captured in unprece- 
dented detail and scale using SMRT sequencing initially. The variety of methylation 
was found on about 800 different loci in this study, indicative of precise specificities 
of methylation present in the bacterial organism. With the use of SMRT sequencing, 
the methylation repertoire was significantly increased. This highlights the key 
advantage of SMRT sequencing to further enhanced the recognition specificities of 
the methylase. Novel mechanistic epigenomic findings include: Type I RM system 
cleavage of DNA at large distances from their recognition sites, while both Type II 
and Type III systems incomplete cleavage pattern. This epigenomic feature is 
problematic for digestion-based analytical methods. The predilection of these 
RMS is toward m4C and m6A, which are readily detected by SMRT sequencing. 
Another understudied aspect of methylation is the orphaned methylases, which are 
common in prokaryotes. This relatively understudied group includes 100 Type II 
methylases. One novel discovery is potential regulatory control due to the genomic 
pattern associated with the orphan methylases which are located on noncoding 
sequences upstream of genes. This potential regulatory role was is widely distributed 
across the prokaryotic organism. In another study, a deeper resolution analysis such 
as identification and quantification of methylation motifs, correlation with methyl- 
ases of methylation motifs using REBASE (Roberts et al. 2015) and identification of 
orphaned methylases has been done in large scale in organisms like Listeria (Chen 
et al. 2017). This study reported lineage- and clade-specific patterns of restriction- 
modification system (RMS). Type II RMS dominates with its presence in 256 out of 
302 genomes, followed by Type I with 110 genomes, Type IV with 73 and lastly by 
Type III with 25 genomes. Methylation motifs were also described. These studies 
highlight the large-scale applicability of sequencing-based epigenomic study to 
unravel population-scale dynamics and patterns. 

On a mechanistic level using fine-scale analysis, Fang et al. explored 6 mA 
methylation in a Shiga toxin-producing a strain of E. coli 0104:H4 Germany outbreak 
isolate predicted to produce 10 methylases that result in the 6-mA modification (Fang 
et al. 2012). A phage-encoded modification system capable of targeting hundreds of 
loci within the E. coli 0104:H4 isolate. This discovery of phage-encoded modification 
system-associated virulence had no prior examples in E. coli, illustrating the immense 
power to untangle epigenomic clues using sequencing platforms. 


4 Conclusion and Future Direction 


The epigenomic studies relied heavily on bioinformatics to deduce motifs that were 
highly enriched by modification with specific methylases. These studies discovered 
novel methylase specificities, quantified methylation activity, identified novel 
enzyme activity, which targets only one strand of DNA and promiscuous gene 
lacking specificity. Such precision is only possible with sequencing technology 
coupled with methylation detection capability. As sequencing technologies advance, 
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the definition of modification will become increasingly important in biological 
function interpretation. A current limitation is that the vast amount of whole genome 
sequence and the limited number of methods to locate and estimate the modifica- 
tions. A proxy for this limitation is to examine the RMS enzymes, which is 
interesting, but not direct enough to derive biologically accurate information. This 
method also suffers from informatics methods that can be applied on a comparative 
population scale, as can be done with pangenomes, but not pan-methylomes for 
bacteria. MethBank is available for a few mammals and plants (Li et al. 2018). The 
rate of bacterial genome production is only increasing. As such, a need exists to 
interrogate methylome of the organism at the speed of sequencing. This is not 
available and is a severe limitation in understanding bacterial growth, survival, 
and association; which is also true of metagenome interrogation as well. A great 
step forward would be to have a similar database for bacteria with the ability to allow 
pangenome and pan-methylome comparisons. 

The field is poised to link the bacterial methylation status with the host methylation 
composition as it relates to disease. However, the dynamic nature of the microbiome, 
gene expression, and methylation in the bacterial component is a substantial chal- 
lenge. Initial stages of examining the microbiome sequence for RMS enzymes are a 
starting point that will aid in understanding the complement of modifications that are 
possible. The beginning of this work has started in cancer progression and to some 
degree single organisms, such as H. pylori, in the development of various stages of 
cancer progression. 

Bacterial metagenome production will increase with the expanded use of real- 
time sequencing technologies, such as nanopores. However, limitations in analysis 
and the dynamic nature of the bacterial DNA modification must be addressed to 
make substantial progress in linking it to phenotype. Future prospects of examining 
methylation are very exciting and there are many needs in the bioinformatic com- 
parative analysis, especially in pathogens associated with chronic diseases. 
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Abstract The first eukaryotes emerged from their prokaryotic ancestors more than 
1.5 billion years ago and rapidly spread over the planet, first in the ocean, later on as 
land animals, plants, and fungi. Taking advantage of an expanding genome com- 
plexity and flexibility, they invaded almost all known ecological niches, adapting 
their body plan, physiology, and metabolism to new environments. This increase in 
genome complexity came along with an increase in gene repertoire, mainly from 
molecular reassortment of existing protein domains, but sometimes from the capture 
of a piece of viral genome or of a transposon sequence. With increasing sequencing 
and computing powers, it has become possible to undertake deciphering eukaryotic 
genome contents to an unprecedented scale, collecting all genes belonging to a given 
species, aiming at compiling all essential and dispensable genes making eukaryotic 
life possible. 

In this chapter, eukaryotic core- and pangenomes concepts will be described, as 
well as notions of closed or open genomes. Among all eukaryotes presently 
sequenced, ascomycetous yeasts are arguably the most well-described clade and 
the pangenome of Saccharomyces cerevisiae, Candida glabrata, Candida albicans 
as well as Schizosaccharomyces species will be reviewed. For scientific and eco- 
nomical reasons, many plant genomes have been sequenced too and the gene content 
of soybean, cabbage, poplar, thale cress, rice, maize, and barley will be outlined. 
Planktonic life forms, such as Emiliana huxleyi, a chromalveolate or Micromonas 
pusilla, a green alga, will be detailed and their pangenomes pictured. Mechanisms 
generating genetic diversity, such as interspecific hybridization, whole-genome 
duplications, segmental duplications, horizontal gene transfer, and single-gene 
duplication will be depicted and exemplified. Finally, computing approaches used 
to calculate core- and pangenome contents will be briefly described, as well as 
possible future directions in eukaryotic comparative genomics. 
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1 The Origin of Eukaryotes 


Respiratory-competent eukaryotic cells emerged more than 1.5 billion years ago, 
from the endosymbiosis of an alphaproteobacterium and an ancestral archaebacte- 
rium, probably belonging to the Asgard clade (Zaremba-Niedzwiedzka et al. 2017). 
This protoeukaryote evolved, concomitantly, a complex system of membrane com- 
partments that would ultimately lead to the isolation of the genomic content within a 
real nucleus (eu karyon in Greek) while the degenerated alphaproteobacteria gave 
rise to the mitochondria (López-García and Moreira 2006). The subsequent acqui- 
sition of photosynthesis through endosymbiosis with a cyanobacteria evolved this 
primitive cell into a protoalga from which all plants will eventually develop. The 
general outline of this scenario has been postulated for more than a century 
(Mereschowsky 1999; Sagan 1967) and modern-day DNA sequencing techniques 
allowed to precisely identify bacteria most closely related to modern eucaryotes, 
hence representing their most probable ancestors. However, the exact order of events 
is still a matter of debate among evolution specialists. Did membranes come first, to 
isolate nucleic acid metabolism from protein and sugar metabolism? Did the mito- 
chondria come first, providing a considerable source of oxidative energy to further 
develop a complex network of membranes? These two scenarios are not necessarily 
exclusive and one may also imagine that a number of different protoeucaryotes 
emerged at roughly the same time (at geological scale) and competed with each other 
within similar ecological niches, until one lineage arose and was eventually selected 
to give rise to all eukaryotic life. 

Given the bacterial origin of nucleated cells, it was assumed that most if not all 
eukaryotic gene families would share homology to prokaryotic genes. However, the 
sequencing of an old deep-branching eukaryote, the excavata Naegleria gruberi 
(Fig. 1), revealed that only 5796 of its 4133 protein families had a clear prokaryotic 
homologue. The remaining genes showed no homology to bacterial sequences and 
therefore appear to be eukaryote inventions. Therefore, one must expect eukaryotic 
pangenomes to be significantly different from any known prokaryotic pangenome. 


2 Sequencing Eukaryotic Genomes 


Modern-day eukaryotes are estimated to represent 8,740,000 land species and 
2,210,000 ocean species, for a total of roughly 11 million, one order of magnitude 
above procaryotes (Mora et al. 2011). Higher estimates, based on plankton sampling, 
suggest figures around 16 million of oceanic eukaryotes and 60 million of land 
species (de Vargas et al. 2015). Eukaryote classification is a complex problem taking 
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its roots into the nineteenth century zoology and botanics, but more recently gained 
much insight from whole-genome sequencing and molecular phylogeny reconstruc- 
tion methods (Felsenstein 2004). Early eukaryotes (or old eukaryotes), such as fungi, 
monocellular green algae, excavata (one of the most basal lineage), amoebozoa, and 
chromalveolata diverged probably between 1.2 and 1.45 billion years ago (Embley 
and Martin 2006). Younger eukaryotes, like vertebrates, emerged 450 million years 
ago (Erwin et al. 2011), whereas Homo sapiens is still in evolutionary infancy with 
an estimated date of divergence from chimpanzee around 6.5 million years ago 
(Green et al. 2010) (Fig. 1). 

The ascomycete Saccharomyces cerevisiae was the first eukaryote whose nuclear 
genome was totally sequenced, more than 20 years ago (Goffeau et al. 1996). In the 
1990s, it took the efforts of 633 scientists from more than 100 laboratories during 
8 years to complete it (Goffeau et al. 1997). In the modern genomic era, sequencing 
is fast, cheap, and allows to decipher whole eukaryotic genomes at unprecedented 
scale and pace in human history. At the present time, 707 different eukaryote 
species, including 54 unicellular animals (Protozoa) or algae, 300 metazoans 
(multicellular animals), 137 plants, and 216 fungi had their genome sequenced to 
various levels of completion and assembly. Indeed, the actual pace at which eukary- 
otes are being sequenced is so elevated, that the aforementioned figures will be 
completely outdated when this book will be published. Remarkably, one of the most 
ambitious current genome projects envisions to sequence all eukaryotic life present 
on planet Earth, and the cost of such a project would be similar to what was spent to 
sequence the first human genome alone (Pennisi 2017). Some of the most represen- 
tative eukaryote species, whose genomes were completely sequenced are 
represented in Fig. 1, on the evolutionary branch they belong to, along with their 
estimated geological period of appearance based on molecular clocks. 


Fig. 1 (continued) group (or clade) that survived to present day. Branch lengths are arbitrary. When 
more than one organism was sequenced in a given clade, only one was shown (for example, among 
all sequenced bird genomes only the paradigmatic Gallus gallus species was represented). Vertical 
dotted lines indicate speciation time from the most recent common ancestor, calculated from 
molecular clocks. For example, Actinopterygians (bony fish) separated from other vertebrates 
approximately 450 million years ago. Note that Precambrian radiation datations were only tenta- 
tively attributed, given the large uncertainties associated to ancient eukaryotes. Circled numbers 
represent whole-genome duplications detected by sequencing. The constriction between Archosau- 
rians and Aves represents the Archaeopteryx, the ancestor of all modern birds (Hillier et al. 2004). 
The smaller arrow between Archosauria and Crocodilia represents the dinosaurian mass extinction, 
66 million years ago, among whom the only survivors were the ancestors of modern-day crocodiles 
(Brugger et al. 2017; Renne et al. 2015). Red circled species were used to define core- and 
pangenomes and are more extensively described in the text 
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3 The 1000 Genome Projects 


One of the most remarkable aspects of modern-day genomics is the ambition to 
describe a large number of individuals (usually in the range of thousands) belonging 
to the same monophyletic group (or clade). When the first eukaryotic genome 
sequences were completed, it became apparent that one genome would not be 
sufficient to describe the whole species. Several programs subsequently started, 
aiming at sequencing a large number of individuals belonging to the same species 
and comparing them to the first genome, usually called “reference genome” because 
its state of completion and annotation was often more advanced. Several of these 
projects have been completed over the last few years: 1011 S. cerevisiae genomes 
(Peter et al. 2018), 1135 Arabidopsis thaliana genomes (The 1001 Genomes Con- 
sortium 2016), 2504 followed by 10,545 human genomes (Telenti et al. 2016; The 
1000 Genomes Project Consortium 2015), and 1483 rice genomes (Yao et al. 2015) 
have already been sequenced, but complete analyses of gene content and core- and 
pangenome calculations are not always published. Even more ambitious endeavors 
are planned: the 10,000 plant genome project led by the Chinese BGI' aims at 
sequencing one representative plant from every major clade (Normile et al. 2017); 
the same institute launched in 2015 the 10,000 bird genome project, in an attempt to 
sequence every one of the 10,500 living bird species (Zhang 2015). The i5K 
initiative is planning to sequence 5000 arthropod genomes (i5K Consortium 2013) 
or the Genome 10K project intends to sequence 10,000 vertebrate genomes 
(Genome 10K Community of Scientists 2009). All these projects—and many others 
to come—will contribute to unraveling the complete set of genes used by eukaryotic 
life forms on Earth. With this wealth of data at hand, assuming it will not be too 
overwhelming for available data storage and computing power, essential questions 
should find their answers. What are the core genes shared by all eukaryotic species? 
How many different versions of the same gene (alleles) can be found? How many 
variable or dispensable genes can be detected in a given species? What is the size of a 
species pangenome, of a clade pangenome, of the eukaryotic pangenome itself? 


4 Defining Eukaryotic Pangenomes: Open or Closed? 


The very notion of pangenome was coined by Hervé Tettelin and colleagues in a 2005 
seminal article, describing sequencing and genome analysis of eight strains of 
Streptococcus agalactiae. Despite a high degree of synteny? between isolates, the 
authors detected 69 genomic islands that were absent in at least one genome, some 
characterized by an atypical nucleotide compositional bias, suggestive of a possible 
acquisition by horizontal transfer. They showed that the number of shared genes in all 


! Beijing Genomics Institute, the largest—by far—sequencing center in the world. 
?Synteny: gene order along a chromosome. 
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species decreased at each addition of a new genome, reaching the minimal number of 
1806 genes. On the contrary, each genome addition increased the number of variable 
genes, those that are absent in one or more strain. They proposed that a bacterial 
species may be defined by a set of genes present in all strains (core-genome) and by a 
dispensable—or variable—set of genes, composed of those present in at least one 
strain but absent from all others. The addition of these variable genes to the core- 
genome would make what was called the “pangenome” (from the Greek word pan 
(mau), meaning “whole’’) (Tettelin et al. 2005). Mathematical modeling showed that 
the pangenome measurement followed the Heap's law, an empirical law used in 
information retrieval, in which as more and more books are read, the number of 
different words grows as a power law of the total number of books read. The function 
form of the power law depends on two parameters: the exponent a and a proportion- 
ality constant. Practically, the number of new genes discovered after each new 
genome sequence will be: n = < N ^, in which x is a constant, N is the number of 
genomes sequenced, and a > 0. For a > 1, the pangenome size approaches a plateau 
as more and more genomes are sequenced, the pangenome is “closed” (Fig. 2a). On 
the other hand, for 0 < a < 1, the pangenome size will increase at each new genome 
addition and the pangenome is "open" (Fig. 2b) (Tettelin et al. 2008). 

Among sequenced bacterial species, some exhibit a closed pangenome, for 
example Staphylococcus aureus (a — 1.84), Streptococcus pyogenes (a — 1.88), 
Ureaplasma urealyticum (a = 2.5) or the extreme case of Bacillus anthracis 
(a = 5.6). Others display an open pangenome, like Bacillus cereus (a = 0.65) or 
the cyanobacteria Prochlorococcus marinus (a = 0.80). Note than when a is equal 
or very close to 1, the pangenome is still open, but the rate of acquisition of new 
genes is very slow. This is the case of Escherichia coli (a — 1.04), Streptococcus 
agalactiae (a = 1.05), or Streptococcus pneumoniae (a = 0.98) (Tettelin et al. 
2008). 


5 Yeast Pangenomes 


5.1 Saccharomyces cerevisiae 


Historically, budding yeast was the first eukaryote whose genome was completely 
sequenced (Goffeau et al. 1996). A British collaborative work in which 
70 S. cerevisiae and S. paradoxus isolates were sequenced to low coverage showed 
that S. cerevisiae strains showed less variability than S. paradoxus strains. Worldwide 
budding yeast population structure was made of a few geographically isolated 
lineages and of several mosaic genomes, and underlined the possibility that humans 
played a major role in producing these variations by transporting and selecting yeast 
strains (Liti et al. 2009). Following this pioneering work, a collaborative effort of two 
French laboratories and the Genoscope led to the completion of 1011 S. cerevisiae 
isolates, collected worldwide, from domesticated, wild, or human origin (mainly 
clinical). This sequencing effort allowed to determine that Chinese and Taiwanese 
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Fig. 2 Open versus closed pangenomes. (a) Closed pangenome. In this example, the number of 
new genes — 400x (Nbr genomes) ^, with a — 2. The number of new genes revealed by each new 
genome sequence rapidly decreases and the pangenome size reaches a plateau. (b) Open 
pangenome. The number of new genes — 400x (Nbr genomes) ^, with a — 0.5. The number of 
new genes revealed by each new genome sequence keeps on growing and the pangenome size 
steadily increases 
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strains were closer to Saccharomyces paradoxus and to the root of the Saccharomy- 
ces sensu stricto than strains from any other origin, strongly supporting a single out- 
of-China origin for S. cerevisiae, that subsequently spread all over the planet. Using 
de novo assembly and a specific detection pipeline, it could be determined that the 
yeast core-genome contained 4940 Open Reading Frames (ORFs) whereas 2856 
ORFs were variable within the population, for a total of 7796 ORFs constituting the 
pangenome (Peter et al. 2018) (Table 1). Core ORFs were mostly found in one copy 
per haploid genome, while ca. 20% of variable ORFs were absent or present in more 
than one copy. The authors subsequently looked at the origin of these variable ORFs 
and classified them in three different groups, based on their phylogeny: ORFs with 
their closest ortholog in another S. cerevisiae strain and consistent with genome 
phylogeny were considered as being ancestral acquisitions; ORFs with their best 
ortholog in another Saccharomyces species were considered to be introgressions; and 
finally ORFs more related to another yeast species outside the Saccharomyces 
complex were treated as horizontal gene transfers (HGT) (Fig. 3a). Using these 
definitions, 1380 variable ORFs were assigned to an ancestral inheritance, 
913 were designated as introgressions, and 183 were likely to be the result of HGT 
events from distant relative yeast species. Half of these HGT ORFs could be traced to 
Torulaspora or Zygosaccharomyces species. Given that these yeasts share similar 
environmental fermentative niches, it is likely that such physical promiscuity favored 
frequent transfer of genetic material between these species. In six cases, large HGT 
events (38-165 kb) were identified, but most isolates retained only mosaics of small 
segments suggesting that the large ancestral HGT underwent several rounds of 
successive deletions leading to the complex patterns observed today. Among the 
913 introgressions, 97% were unambiguously acquired from S. paradoxus, all 
S. cerevisiae ORF carrying at least one S. paradoxus ORF, suggesting continuous 
gene flows between these two yeast species. This is in good agreement with a former 
work using microarrays to genotype Saccharomyces strains of different origins, in 
which most introgressions detected in S. cerevisiae came from S. paradoxus (Dunn 
et al. 2012). Finally, two-thirds of ancestral acquisitions were present in at least half 
the yeast isolates, suggesting that they segregated in most strains since the time of 
their acquisition (Fig. 3b). 

The core- and pangenomes of the S288C reference strain were analyzed more 
thoroughly for variable gene functions. Out of 6081 ORFs, 1144 were identified as 
variable. The distribution of these ORFs was found to be skewed toward 
subtelomeric regions, which have been known for a long time to be highly poly- 
morphic among yeast strains and species (Fabre et al. 2005). Functions of variable 
ORFs were strongly enriched for cell-wall and membrane components, cell-cell 
interactions, and secondary metabolism. Finally, core-genome ORFs were found to 
exhibit lower levels of loss-of-function mutations, as compared to pangenome 
ORFs, as well as a lower dN/dS ratio of nonsynonymous over synonymous sub- 
stitutions, showing that the former were less constrained than the latter. 
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Fig. 3 Variable ORFs of the S. cerevisiae pangenome. (a) Phylogenetic origin of variable ORFs. 
ORFs were considered ancestral acquisitions when the best match was found to be a S. cerevisiae 
ORF (blue arrow), it was treated as an introgression when the best homolog was another Saccha- 
romyces species (purple arrow), or a horizontal gene transfer (HGT, red arrow) when it was found to 
be another yeast species. (b) Distribution of variable ORFs. The number of isolates is indicated on 
the X-axis and the number of variable ORFS in each category is represented on the Y-axis 


5.2 Candida glabrata 


C. glabrata is an opportunistic pathogen responsible for candidiasis and bloodstream 
infections in immunocompromised patients (Bodey et al. 2002). It is the second cause 
of nocosomial infections, after Candida albicans, and a growing concern in public 
health, due to its resistance to azole antifungal drugs (Pfaller and Diekema 2004). 
Despite its genus name, its genome is closer to S. cerevisiae than to C. albicans. It 
belongs to the Nakaseomyces clade that also includes Candida nivariensis and 
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Candida bracarensis, two emerging pathogens, as well as Nakaseomyces delphensis, 
Nakaseomyces bacilisporus, and Candida castellii, three nonpathogenic species 
(Fig. 4). Comparison of orthologous proteins conservation shows that this clade is 
as distant from the Saccharomyces clade as man is distant from fish (Dujon 2006). 
Hence, the distance between orthologous proteins belonging to these two monophy- 
letic groups is similar to the distance covered by vertebrate proteins since the 
actinopterygian radiation, some 450 million years ago" (Fig. 1). C. glabrata exhibits 
frequent chromosome polymorphisms among different isolates, due to translocations, 
copy number variations (CNV), gene tandem amplifications (Muller et al. 2009), 
formation of neo-chromosomes (Polakova et al. 2009), and the presence of many 
large tandem repeats known as megasatellites (Rolland et al. 2010; Thierry et al. 
2008, 2009). The five aforementioned pathogenic and nonpathogenic Nakaseomyces 
species were sequenced to high coverage and their sequence was compared to the 
C. glabrata CBS138 reference strain (Dujon et al. 2004). Protein contents range from 
4875 for C. castellii to 5315 for C. bracarensis, figures significantly lower than the 
5886 S. cerevisiae proteins (Gabaldon et al. 2013). Among gene losses in 
Nakaseomyces, four entire multigene families (PHO, SNZ, SNO, and PAU) were 
absent in all species or represented by only one member in C. castellii or 
N. bacillisporus. These genes are involved in phosphate metabolism (PHO), in 
nutrient limitation response (SNZ and SNO), or in alcoholic fermentation (PAU). 
The loss of BNA genes, functioning in de novo synthesis of nicotinic acid probably 
results from the yeast adaptation to its human host, since colonization of the urinary 
tract occurs through induction of adhesin genes, upregulated in nicotinic acid-poor 
medium, such as urine (Domergue et al. 2005). The C. glabrata genome contains a 
large number of genes that are absent from S. cerevisiae and specifically involved in 
adhesion and virulence. The EPA genes, a family of glycosyl-phosphatidylinositol 
cell-wall genes, completely absent from S. cerevisiae, was represented by 18 members 
in the C. glabrata reference strain (CBS138), and seven additional genes were present 
in the BG2 strain, widely used in adhesion studies (Cormack et al. 1999). Remark- 
ably, the two other pathogenic species, C. bracarensis and C. nivariensis, contained 
respectively 12 and 9 members of the EPA family, whereas the nonpathogenic 
N. delphensis and C. castelli harbored respectively one and three copies and 
N. bacillisporus presented only one distant homologue. In addition, the C. glabrata 
genome contained 44 genes comprising internal repeats, whose motifs were 
135-300 nt long, tandemly repeated 3-30 times in frame (Thierry et al. 2008). 
These megasatellites encode many serine and threonine residues and genes harboring 
these tandem repeats were proposed to encode cell-wall glycoproteins and to be 
involved in cellular adhesion (Thierry et al. 2009). Phylogenetic studies of 21 fungal 
genomes showed that these megasatellites were uniquely found in C. glabrata, but 
their presence among other members of the Nakaseomyces has not been tested yet 
(Tekaia et al. 2013). 


>This does not mean that Saccharomyces and Nakaseomyces diverged 450 million years ago, 
because there is no reliable molecular clock for yeasts. 


264 G.-F. Richard 


Aminoacid 
conservation (%) 


. Homo 
x Scerevisiae v sapiens 
Mus 
musculus 


Gallus 

gallus 
Tetraodon 
A nigroviridis 


Saccharomycotina 


Candida dubliniensis* 
Candida albicans* 


Ciona 
& intestinalis 


Yarrowia lipolytica 


Ascomycota 
Taphrinomycotina 


Schizosaccharomyces octosporus 
Schizosaccharomyces pombe 
Schizosaccharomyces japonicus 
Schizosaccharomyces cryophilus 


Dikarya ancestor 


* human pathogen po ra 


Fig. 4 Yeast pangenomes of the Dikarya tree. On the left, the figure shows some of the yeast 
species whose genomes were completely sequenced, arranged by clade. Branch lengths are arbitrary 
and do not reflect evolutive distances. On the right, amino acid conservation of orthologous proteins 
between yeast and between animal species are indicated (adapted from Dujon 2006) 


In a very recent study, 33 isolates of C. glabrata of different geographical origins 
were fully sequenced and compared to the CBS138 reference strain (Carreté et al. 
2018). Altogether, 108 genes were deleted or duplicated in these strains, half of them 
encoding glycosylphosphatidylinositol-anchored adhesin homologues, showing the 
extensive variability of this gene family within this clade. The core-genome 
contained 3603 proteins, significantly less than for S. cerevisiae (see above). On 
the contrary, the number of variable ORFs was higher than budding yeast, since 
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302—580 predicted genes (mean: 342) were found to be unique of each isolate, for a 
total of 9915 strain-specific genes among 29 strains considered.* This figure may be 
partially overestimated, due to automated annotations or clustering artifacts, but 
from these data one may infer that the C. glabrata pangenome covers 13,000—14,000 
genes, almost twice as many as the S. cerevisiae pangenome. 

In conclusion, yeasts of the C. glabrata clade contain significantly fewer genes 
than S. cerevisiae, with specific gains and losses as compared to their distant cousin. 
However, gene content is highly variable among Nakaseomyces and the C. glabrata 
pangenome size is larger than the S. cerevisiae pangenome, although further ana- 
lyses are needed to narrow down these numbers. 


5.3 Schizosaccharomyces Genomes 


Fission yeasts are very distant relatives of S. cerevisiae and the Taphrinomycotina 
clade comprise only four known species: Schizosaccharomyces japonicus, 
Schizosaccharomyces cryophilus, Schizosaccharomyces octosporus, and the model 
yeast Schizosaccharomyces pombe. They form a basal branch of the Dikarya? tree 
(Fig. 4) and exhibit very distinct life history and metabolism as compared to 
Saccharomycotina. Under many aspects, S. pombe is actually closer to metazoans 
than to budding yeasts: among the more prominent features, large repetitive centro- 
meres, heterochromatin histone methylation, heterochromatin proteins, RNA inter- 
ference, telomere-binding proteins, cell-cycle control, the mitochondrial translation 
code, splicing and spliceosome components are more similar to metazoans. In 
addition, core orthologous genes in S. pombe are closer to metazoan genes than to 
other Ascomycota. Phylogeny reconstruction of the clade using high coverage 
sequence of the four Schizosaccharomyces species and 440 single-copy core 
orthologues surprisingly revealed that S. pombe and S. japonicus were as far to 
each other (55% average amino acid identity) as man and Ciona intestinalis, an 
urochordate (Fig. 1) (Rhind et al. 2011). The two other species, S. octosporus and 
S. cryophilus, were closer to each other (8596 amino acid identity). Retrotransposons 
are numerous in S. japonicus and sequence divergence of their reverse transcriptase 
suggests that they predate the last ancestor of the Ascomycota. However, transposons 
were dramatically lost in the three other species, since S. pombe harbors two related 
retrotransposons, S. cryophilus contains only one and S. octosporus only has 
sequence relics of reverse transcriptase sequences. This loss was accompanied by 
a reorganization of centromere architecture, replacing the numerous transposons 
found at S. japonicus centromeres by other kinds of repeated sequences unrelated 
to transposons and specific of each of the other three species. 


“Four isolates were excluded from this analysis because of low-quality assembly. 
Ascomycota and Basidiomycota together form the Dikarya. 
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Out of ~5000 coding genes in fission yeasts, 4218 (84%) were identified as 
single-copy orthologues common to all four species. For some gene families, the 
level of conservation was even higher: 93% of protein kinases were common and 
more surprisingly 81% of introns (2901 out of 3601) were identical across the clade. 
Most gene gains were species- or clade-specific genes not found in another yeast 
species, whereas gene loss included the glyoxylate cycle, glycogen biosynthesis, the 
phosphoenolpyruvate carboxykinase, fewer ADH genes and lack of transcriptional 
regulators of glucose repression, all these changes reflecting the inability of fission 
yeast to use ethanol as a carbon source, although it produces it by fermentation. 
Hence, despite large evolutionary distances of conserved orthologous proteins, 
Schizosaccharomyces show a remarkably stable gene content, supporting a 
pangenome size only 10-20% larger than its core-genome. 


5.4 Candida albicans 


Candida albicans is another opportunistic pathogen, responsible for mucosal and 
systemic infections in immunocompromised patients. It is also a commensal of the 
gastrointestinal tract. Natural isolates of C. albicans are diploid and under specific 
conditions they are able to mate, resulting in tetraploid cells subsequently shifting to 
diploidy via random chromosome loss (Bennett and Johnson 2003). The nuclear 
genome of $C5314, a standard laboratory strain widely used in molecular analyses, 
was published in 2004. It revealed a high level of single-nucleotide polymorphisms 
(SNP) between both homologues, representing 90% of all detected polymorphisms, 
with an average frequency of one SNP in 237 bases. Heterozygosity was not 
homogeneous, since several chromosomes were interrupted by large regions of 
homozygosity (Jones et al. 2004). After that initial study, 21 clinical isolates of 
C. albicans, characterized by different phenotypic profiles, were also completely 
sequenced. Single-nucleotide polymorphisms were very limited among the isolates, 
being one order of magnitude lower than what was commonly found among 
C. glabrata strains (Gabaldón and Fairhead 2019). The gene content of these isolates 
was very similar to that of SC5314 reference strain, since most of its genes were 
present in all isolates (6069 genes out of 6189—or 98%— on the average), with few 
variable genes (Table 1). Genes exhibiting the most variable number of copies were 
retroelements as well as the subtelomeric TLO gene family. The position and number 
of TLO genes varied from 10 to 15 among isolates, indicative of a high level of 
plasticity (Hirakawa et al. 2015). More recently, the Candida dubliniensis genome, 
another opportunistic pathogen, less virulent than C. albicans, was sequenced. 
Except for translocations and chromosomal rearrangements that may be expected 
between two yeast species, both gene contents were found to be surprisingly similar. 
Out of 5569 orthologues, 5363 (96.3%) were more than 80% identical at the nucle- 
otide level, and synteny was conserved for 98% of genes (Jackson et al. 2009). The 
search for species-specific genes identified 111 ORFs in C. dubliniensis and 191 in 
C. albicans. However, most of these variable ORFs corresponded to transposable 
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elements. When these were filtered off, the real number of species-specific genes 
dropped to 29 and 168, respectively. Among those, the TLO gene family (12 members 
in C. albicans) was specifically expanded in this species, since only two copies were 
detected in C. dubliniensis and species-specific copies were monophyletic, 
supporting an independent expansion in C. albicans. On the contrary, the IFA gene 
family (13 members in C. albicans) underwent massive gene loss in C. dubliniensis, 
since several gene relics at various stages of decay were identified in this yeast 
species. In conclusion, in the present state of analysis, it appears that the core- 
genome common to C. albicans and C. dubliniensis probably approximates 5400 
genes and that their pangenome may be predicted to be slightly larger, possibly 
around 6200 genes. 


6 Plant Pangenomes 


6.1 Soybean Genomes 


Glycine max is the cultivated soybean variety, whose genome was published in 2010 
(Schmutz et al. 2010). It was domesticated 5500 years ago and has been under 
intensive selection by human populations for yield increase. It diverged from the 
wild variety, Glycine soja, 800,000 years ago, well before its domestication. There- 
fore, natural selection contributed to differentiation of the two subspecies well before 
human selection started. In order to estimate the genetic diversity between domes- 
ticated and wild soybean species, the genome of seven Glycine soja isolates from 
south-east Asia were sequenced and compared to each other and to G. max (Li et al. 
2014). Gene number ranged from 54,256 to 57,631, depending on the isolate and 
hundreds of genes were identified as gained or lost as compared to domesticated 
soybean. The G. soja core-genome contained 28,716 genes, while 30,364 variable 
genes were identified. Most of them (58%) were shared by two to six out of seven 
samples, whereas 12,916 (42%) were uniquely found in one of the seven isolates. 
The pangenome therefore contained 50,080 genes and covered 986.3 Mb of 
sequence. Its size increased with each new isolate, but it did not reach an asymptote, 
suggesting that adding new isolates would increase pangenome size (Fig. 2). Inter- 
estingly, dispensable genes exhibited more sequence variability than core genes. 
SNP frequency was at 2.67 sites per kilobase for variable genes, whereas it was 
significantly higher for core genes (4.12 sites per kilobase), and a similar bias was 
found for indels. Biological processes enriched in dispensable genes include specific 
metabolic processes, antioxidant activity, and structural molecule activity. These 
genes were also less conserved than core genes since 58% could not be assigned to a 
functional annotation, as compared to only 34% of the core genes. Lineage-specific 
genes include 11 genes implicated in effector-triggered immunity, acting as patho- 
gen detectors, reflecting adaptation to various biotic stresses. 

The domesticated soybean genome contains 1794 genes involved in acyl lipid 
metabolism, illustrating the effect of its intense selection for oil and fatty acid 
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production. Among those, 32 exhibited CNV when compared to Glycine soja, 
252 contained SNPs or indels and 21 showed high dN/dS ratios, suggestive of 
their possible positive selection in Glycine max. 

In conclusion, G. soja pangenome was found to be twice as large as its core- 
genome, and its comparison with the domesticated G. max species revealed the 
effect of human selection on this widely cultivated crop. 


6.2 Rice Genomes 


Rice (Oryza sativa L.) is one of the most important crops in the world, feeding half 
the world population. The genome sequence of this monocotyledon was published in 
2005 (International Rice Genome Sequencing Project 2005), although draft 
sequences of each chromosome were released earlier. Domesticated rice comprises 
two subspecies: indica and japonica. The reference genome (Nipponbare) is a 
japonica subspecies and contains 37,544 protein-coding genes, among which 2859 
(8%) seemed to be uniquely found in rice. In an effort to explore the genetic diversity 
of cultivated rice, 1483 sequences of both subspecies from 73 countries, sequenced 
at low coverage (1-3 X), were compared to the reference genome. Comparison of 
both subspecies sequences to the reference genome identified 8991 predicted genes 
for the dispensable indica genome and 6366 for the japonica genome. Among these, 
strong evidence of expression or high homology was found for 1120 genes of the 
japonica dispensable genome and 1913 genes of the indica dispensable genome. Out 
of these 1913 high confidence genes, 1189 (6296) contained a recognizable protein 
domain, for a total of 276 different protein domains altogether (Yao et al. 2015). 

In a more recent study, 66 isolates of cultivated rice as well as wild rice (Oriza 
rufipogon)? were sequenced to high coverage and the corresponding genomes were 
de novo assembled and compared (Zhao et al. 2018). Chromosomal introgressions 
from indica were detected in ~16% of tropical japonica genomes. Numerous 
insertions and deletions were identified within genes, since a total of 10,872 genes 
were at least partially absent from the reference genome, due to large indels. Protein- 
coding genes present in at least one isolate were annotated and all transposable 
elements were filtered out. A total of 26,372 genes were found to be common to 
more than 60 rice isolates and were therefore considered to constitute the rice core- 
genome. Variable genes, present in less than 60 genomes, were assigned to a 
dispensable set of 16,208 genes, so that the rice pangenome reached a total of 
42,580 genes. A larger proportion of core proteins (7896) than of dispensable pro- 
teins (36%) matched to known domains, suggesting that some of these variable 
genes may be pseudogenes or artifacts. Among dispensable genes, abiotic and biotic 
response genes, controlling disease resistance in rice were found to be enriched. 
When coding genes were sequentially added from each genome, the number of 


528 Oriza sativa japonica, 25 Oriza sativa indica, and 13 Oriza rufipogon isolates. 
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different genes reached a plateau, although more pronounced for gene families than 
for singletons. This strongly suggests that the rice pangenome is almost closed and 
that further sequencing of rice isolates will not prove to be very useful in identifying 
new dispensable genes (Table 1). 


6.3 Maize Genomes 


Transcriptome sequencing of polyadenylated mRNAs was used in a genome-wide 
study as a proxy to determine the complete set of protein-coding genes within 
503 diverse maize inbred isolates of different origins (Hirsch et al. 2014). 
RNA-seq reads were mapped to the Zea mays reference genome and reads that did 
not match were used for identification of novel transcripts. To limit redundancy, only 
the longest transcript of each locus was taken into consideration for further analysis. 
A total of 8681 high confidence transcripts that were absent from the reference 
genome were categorized as dispensable genes. Among those, 50% matched with 
rice and sorghum proteins, ruling out that they could be artifacts or contaminants. 
Transcripts detected in all isolates, including the reference line, represented 16,393 
genes and constituted the core-genome. Dispensable transcripts, that were identified 
in only a subset of isolates, represented 25,510 genes, for a pangenome of 41,903 
genes, very close to the rice pangenome, although the proportion of variable genes 
was much higher in maize (61% vs. 38% for rice). Sequential addition of genes 
belonging to each isolate revealed that the number of different singletons and gene 
families reach a plateau (more pronounced for singletons), demonstrating that the 
maize pangenome was closed, or very close to completion (Table 1). 


6.4 Cabbage Genomes 


Brassica oleracea is a diploid eudicotyledon, comprising remarkably morphologi- 
cally diverse crops, including cabbage, cauliflower, broccoli, Brussels sprout, kohl- 
rabi, and kale. The B. oleracea pangenome was built by sequencing nine isolates 
(eight cultivated and one wild—Brassica macrocarpa) and anchoring them on one 
of the two reference genomes (Parkin et al. 2014). The assembled pangenome covers 
587 Mb and represents 61,379 genes, after removal of transposable elements. The 
core-genome constitutes the majority of the pangenome, representing 49,895 genes 
(81%), whereas 11,484 genes (19%) are variable, 1322 (2%) being present in only 
one line. Dispensable genes were enriched for functions predicted to be involved in 
disease resistance, defense response, water homeostasis, amino acid phosphoryla- 
tion, and signal transduction. Lineage-specific variable genes comprised biotic and 
abiotic stress response genes, similar to what was observed in rice and soybean. 
B. oleracea underwent a whole-genome triplication specific to this lineage, in which 
gene families involved in auxin function and in morphological variations were 
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amplified, these last ones perhaps contributing to the wide morphology diversity 
observed in this species. 

There are 14 variable genes predicted to regulate flowering time and maturity in 
B. oleracea, but all of them were absent from one of the two reference strains 
(TO1000), a rapid cycler. One of the flowering loci, FLC (Flowering Locus C), is 
an important regulator of vernalization and regulates flowering time variation by the 
number of gene copies. One FLC gene was present in Arabidopsis thaliana, whereas 
four paralogues were found in B. oleracea. All four were part of the core-genome 
and two additional homologues were detected: one was present in all lines except the 
TO1000 reference strain and the other was present only in B. macrocarpa and one 
isolate (Cauliflowerl). Independent functional studies showed that disruption of this 
gene in cauliflower led to early flowering, strongly suggesting that its absence in 
TO1000 was responsible for the early flowering of this rapid cycler (Golicz et al. 
2016). 

Genetic signatures of the core-genome and of the variable genome are very 
different. Core genes are longer on the average and harbor more exons. They also 
have lower mean SNP density and the ratio of non-synonymous over synonymous 
substitutions was lower than for variable genes, suggesting that core genes were 
under a more selective purifying selection than variable genes. In conclusion, 
B. oleracea core and variable genes exhibit the same properties that were observed 
in other eucaryotic pangenomes. 


6.5 Poplar Genomes 


The genome of Populus trichocarpa, black cottonwood, was published in 2006. Out 
of its predicted 45,555 protein-coding genes, 40,307 (88%) had a homologue in 
Arabidopsis thaliana, while conversely 91% of A. thaliana predicted genes showed 
some similarity to a P. trichocarpa gene (Tuskan et al. 2006). More recently, six 
isolates of other poplar species, four Populus negra and two Populus deltoides, were 
sequenced to 26-45X coverage and compared to the P. trichocarpa reference 
genome. Genome comparisons identified 7889 deletions and 10,586 insertions in 
the two newly sequenced species, as compared to P. trichocarpa. However, a large 
majority of these were due to transposons and retrotransposable elements (62% of 
deletions and 84% of insertions), a feature shared by all plant pangenomes 
sequenced so far. Once transposon sequences were filtered out, 3230 genes 
exhibiting CNV signatures between at least two of the samples were detected. 
These CNVs were significantly more abundant within 3 Mb from telomeres and 
corresponded to gene additions or deletions in one or more sample. A total of 
230 variable genes were detected among P. nigra samples, and of 174 dispensable 
genes between the two P. deltoides isolates. The reference P. trichocarpa genome 
showed 187 genic variations with P. nigra and 213 with P. deltoides. Among these 
dispensable genes, 70% belonged to a gene family, allowing to detect some over- 
represented gene functions. Remarkably, variable genes were preferentially involved 
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in signal transduction, receptor activity, and disease resistance, similarly to what was 
observed for soybean, rice, and cabbage (Pinosio et al. 2016). 

The authors of this study calculated that the poplar pangenome was approxi- 
mately 500 Mb, 80% being shared by all the isolates and therefore constituting the 
core-genome. When P. nigra and P. deltoides genomes were compared to the 
reference P. trichocarpa, 2270 genes were absent from at least one sample and 
2453 other genes were detected in a variable number of copies, for a total of 4723 
variable genes. Unfortunately, the proportion of dispensable genes between P. nigra 
and P. deltoides was not determined, and it was therefore not possible to figure out 
the exact size of the poplar pangenome. However, estimates suggest a size of 
£234,000 genes for the core-genome and ~12,000—13,000 variable genes, giving a 
pangenome size of ~46,000-47,000 genes. Using available data about P. nigra 
dispensable genes, it is tempting to suggest that its pangenome should be closed. 


6.6 Mamiellales Genomes 


Micromonas pusilla is a marine picoeukaryote of the Mamiellales order, measuring 
less than 2 um and living in all oceans worldwide. Two independent isolates of 
M. pusilla were sequenced and their genomes were compared to those of 
Ostreococcus lucimarinus and Ostreococcus tauri, two other Mamiellales. Surpris- 
ingly, the two Micromonas shared only 90% of their 10,000 predicted genes, 
whereas the two Ostreococcus shared 97% of theirs. Comparison of the four 
sequences allowed to define a core-genome containing 7137 genes, involved in 
photosynthesis, hydroxyproline-rich glycoproteins (essential components of plant 
cell-wall), and meiosis genes. These were unexpected since Mamiellales are gener- 
ally considered to be asexual, suggesting that these genes were remnants of their 
common ancestor with land plants, or alternatively that they possessed a kind of 
sexuality that has not been described yet. This last hypothesis would be compatible 
with the presence of glycoproteins known to be expressed after sexual fusion in 
Chlamydomonas reinhardtii. In addition to core genes, 14% of proteins (1384) were 
shared by both Micromonas isolates but were not found in Ostreococcus. These 
include enzymes for plastid peptydoglycan synthesis. These “shared” genes were 
found to evolve more rapidly than core genes. A large proportion of genes present in 
only one of the two Micromonas isolates exhibited homology to animal or bacterial 
lineages, supporting their acquisition by horizontal transfer. Altogether, 793 and 
826 genes were unique to each of the two Micromonas isolates, 689 were specific of 
O. tauri and 249 were unique to O. lucimarinus. These variable genes when added to 
the 7137 core genes and to the 2824 genes shared by at least two of the four 
genomes, gave a Mamiellales pangenome size of 12,518 genes (Nordberg et al. 
2014; Worden et al. 2009). 
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7 Animal Pangenomes 


7.1 Drosophila Genomes 


Drosophila melanogaster is one of the most intensively studied animal models. The 
first draft of its genome was published in 2000 (Adams et al. 2000). Its euchromatin 
part covered 2:120 Mb and contained 13,600 genes, only twice as many as budding 
yeasts. Following this pioneering work, 11 other fly species originating from Africa, 
Asia, the Americas, and the Pacific islands were sequenced and compared to 
D. melanogaster reference genome. Gene numbers range from 13,733 for 
D. melanogaster’ to 17,325 for Drosophila persimilis. Sequence comparisons 
established that 49% of D. melanogaster genes were conserved as single-copy 
orthologues across the whole set of species, defining a set of 6698 core genes. 
Collectively, the 12 Drosophila genomes contain 40,852 variable genes, for a 
pangenome size of 47,550 genes, but unfortunately it was not possible to determine 
if this pangenome was closed or open with published data. However, some interest- 
ing observations were made. First, effector proteins (like antimicrobial peptides) 
evolved by rapid duplications and deletions and were significantly underrepresented 
in the core-genome. Second, gene families forming most of the variable gene content 
expanded or contracted at a rate of one fixed gene gain or loss every 60,000 years. 
Common functions among some of the rapidly evolving families include defense 
response and proteolysis. Third, the vast majority (98%) of Drosophila proteins were 
ancestrally present at the root of the genus. Out of the 296 non-ancestral proteins, 
252 were specific of the Sophophora subgenus or were complex acquisitions. The 
remaining 44 genes were lineage-specific (four of them are found only in 
D. melanogaster), were shorter than the average, harbored fewer introns and 4096 
of them (18/44) were testis-specific, consistent with previous observations about 
new Drosophila genes (Drosophila 12 Genomes Consortium 2007). 

In conclusion, Drosophila core-genes represent roughly 40-50% of each species 
gene pool and variable genes arise most of the time by duplication or deletion of an 
existing gene, with very little de novo gene creation. 


7.2 Avian Genomes 


Birds encompass the richest variety of species among tetrapod vertebrates, with 
more than 10,000 different species. In an international effort, 48 avian species, 
covering most avian clades were sequenced to low or high coverage and compared 
to the existing three reference genomes (zebra finch, turkey, and chicken), as well as 
to three crocodilian genomes, the closest bird relatives. After filtering for transpos- 
able elements, each genome was predicted to contain ~14,000—17,000 genes. They 


7Gene number was refined since publication of the draft genome sequence. 
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contained a low level of repeated elements (4—10%) as compared to other tetrapods 
(34-52% in mammals, for example). 

Genes responsible for morphological and physiological peculiarities of the clade 
were analyzed more in depth. Flight capacity was permitted through duplication and 
positive selection of genes regulating skeleton morphology and bone development. 
Out of 89 genes involved in ossification half of them showed traces of positive 
selection, compared to one-third of the 31 orthologous genes in mammals. 

Feathers are made of a- and f-keratins, the latter only found in birds and reptiles. 
Aves genomes contained fewer a-keratin genes as compared to mammals but the 
repertoire of B-keratins has expanded (up to +150 copies in zebra finch). Similarly, 
most avian genomes contained a higher number of opsin genes than mammalian 
genomes, partly explaining their more advanced visual system. Genomic elements 
that were highly conserved among the 48 bird genomes were identified genome 
wide. Such elements covered 11 Mb (1% of the avian genome) and were signifi- 
cantly underrepresented in coding regions. Actually, the proportion of conserved 
elements in noncoding regions were 50-fold higher and mostly corresponded to 
regulatory regions of developmental genes. This result suggested that few avian- 
specific genes arose in this clade, most of the genomic changes resulting from 
differences in developmental regulations (Seki et al. 2017). 

In conclusion, avian genomes are smaller than mammalian genomes, both in size 
and in number of genes, due to extensive deletions of chromosomal segments in the 
ancestral lineage. More precise analyses are now required to sort out core genes from 
dispensable ones in order to be able to define core- and pangenome sizes and 
contents. 


7.3 Human Genomes 


The first human genome drafts were published in 2001 at the same time by the 
Human Genome Sequencing Consortium and by Celera Genomics (International 
Human Genome Sequencing Consortium 2001; Venter et al. 2001), and a more 
complete version of the academic sequence was released in 2004 (International 
Human Genome Sequencing Consortium 2004). A few years later, James Watson's 
own genome was deciphered (Wheeler et al. 2008), rapidly followed by the first 
Asian genome (Wang et al. 2008) and the first African genome (McKernan et al. 
2009). The human pangenome was built from comparisons between the NCBI 
human reference genome and four genomes: Venter's (Celera Genomics), Watson's, 
YH (Asian genome), and NA18507 (African genome), as well as individual human 
sequences retrieved from GenBank. Four types of sequence variants were detected: 
(1) sequences that were frequent in African populations but rapidly declined out of 
Africa; (2) sequences that were rare in African populations but became more 
frequent with geographical distance; (3) sequences that were present at a low 
frequency in European populations; and (4) sequences that were rare in Asian 
populations. This analysis led to the conclusion that the human pangenome should 
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include 19-40 Mb of additional sequence in addition to the reference genome and 
that complete coverage of all gene variants should be achieved with the sequencing 
of 100-150 randomly sampled individuals, worldwide. Analysis of sequences that 
could not map to the reference genome showed that some of the most abundant 
genes were those encoding DUX homeobox proteins (113 hits in YH and 58 in 
NA18507), known to be associated with chromatin. Also very frequent were gene 
families known to be rapidly evolving, such as mucins, zinc-finger proteins, and 
olfactory receptor proteins (Li et al. 2010). 

In conclusion, the present-day human pangenome is still open and will require 
many more finished sequences in order to be resolved. No doubt that recent efforts to 
sequence 1070 Japanese genomes (Nagasaki et al. 2015), 2504 individuals from 
26 worldwide origins (The 1000 Genomes Project Consortium 2015) or 10,545 
human genomes representative of the main human populations (Telenti et al. 2016) 
should allow to more precisely define human core- and pangenomes and definitely 
solve this question. 


7.4 Reaching for the Metazoan Pangenome 


With a wealth of more than 300 metazoan genomes sequenced, defining a core- and a 
pangenome for multicellular animals could seem a reachable goal. However, with an 
estimation time for the last common ancestor of all metazoans around 800 million 
years ago (Erwin et al. 2011), identification of a reliable set of core genes might 
prove challenging. The sponge Amphimedon queenslandica is an early metazoan 
(Fig. 1) whose genome was sequenced in 2010. It is predicted to contain 18,693 
protein-coding genes. Comparison with 4670 metazoan gene families defined a set 
of 1286 proteins that seem to be metazoan specific, thus defining a draft core- 
genome for multicellular animals (Srivastava et al. 2010). Many gene expansions 
observed in the metazoan lineage arose by subsequent tandem or local gene dupli- 
cations, but extensive work is now needed in order to extract this information from 
available metazoan genome sequences. 


8 The Oceanic Pangenome 


The TARA ocean program aims at sampling all planktonic lifeforms of the world's 
ocean (de Vargas et al. 2015). Metatranscriptomes were established from high- 
coverage polyA RNA-Seq performed on 441 size-fractionated planktonic commu- 
nities. Subsequent clustering created a nonredundant set of 116 million transcribed 
sequences, at least 150 bases long. Despite the sampling effort, it was calculated that 
166-190 million sequences would be needed to reach saturation of all oceanic 
eukaryotic expressed sequences. Half of these sequences had no match in public 
databases, suggesting that they may correspond to new genes, but most of these 
(60%) were present as single copies. Transcription of these new genes showed that 
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they were expressed to the same level as known families, suggesting that they were 
conserved in a smaller number of species or that they were present in less abundant 
taxonomic groups. Increasing the sampling effort should solve this issue (Carradec 
et al. 2018). These data, although preliminary and not totally exhaustive, demon- 
strated that it was possible to extract thousands of new eukaryotic genes belonging to 
yet uncharacterized species from large oceanic metagenomes. It would be difficult to 
use the same approach for land eukaryotes for which a comprehensive sampling will 
be much more tedious and time consuming. 


8.1 The Haptophyte Alga Emiliania huxleyi 


Marine phytoplankton is responsible for carbon fixation and export to the sea floor as 
calcite, as well as carbon dioxide release during the calcification process. Their 
influence on carbon metabolism and export to the deep ocean is complex and crucial 
for the Earth ecosystem. The haptophyte E. huxleyi CCMP1516 reference genome 
was determined, as well as 13 other isolate genomes, from subarctic to tropical 
oceanic origins (Read et al. 2013). Repetitive elements were extremely abundant, 
representing about two-thirds of the sequence and include retrotransposons (1%), 
DNA transposons (3%), rDNA-related repeats (3%), paralogous genes (10%), tan- 
dem repeats and low complexity regions, especially 10-11 bp tandemly repeated 
minisatellites (34%) and unclassified repeats (16%). These repetitive elements 
account for a large part of the considerable genome size variability, that ranges 
from 99 to 133 Mb between isolates (141.7 Mb for the CCMP1516 reference). The 
reference genome gene content was then compared to three isolates of very distant 
origins." Out of 30,569 predicted genes in the reference, a total of 5218 (17%) were 
absent from at least one of three isolates and 364 were missing from all three. Further 
comparisons with the other isolates strengthened this conclusion: the core-genome 
contained 20,055 genes, about two-thirds of the reference genes, whereas the 
remaining genes were variable, making E. huxleyi pangenome a complex gene 
repertoire. Besides repeated elements, the genome encodes many iron-binding pro- 
teins, 80 in the core-genome and 30 as variable genes. Iron is essential for calcifi- 
cation and photosynthesis and these differences probably reflect ecological 
disparities among isolates. In addition, the E. huxleyi pangenome encodes 700 pro- 
teins whose function relies on metal binding: selenium (49 proteins, 20 gene fam- 
ilies), zinc (413 proteins), or copper (65 proteins). Finally, the pangenome contains 
26 genes involved in vitamin metabolism, but is unable to synthesize vitamins B, 
and Bj», restricting E. huxleyi to oceanic regions where these are freely available. In 
conclusion, the large pangenome of this haptophyte is probably necessary to accom- 
modate its ubiquitous distribution in oceans and illustrates physiological and mor- 
phological disparities observed among isolates. 


*English channel, north-eastern pacific ocean and Great Barrier reef. 
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9 Where Do Eukaryotic Variable Genes Come From? 


At the present time, there are six independent origins for novel eukaryotic genes: 
interspecific hybridizations, whole-genome duplications, segmental duplications, 
horizontal gene transfer, single gene duplication, and de novo gene creation (Fig. 5). 


9.1 Interspecific Hybridizations 


The American botanist Edgar Shannon Anderson published in 1949 a book describ- 
ing interspecific hybridizations between flowering plants and genotype combina- 
tions resulting from these crosses (Anderson 1949). Since then, it became widely 
accepted among botanists that such events were frequent among plants, resulting in 
frequent transfers of genes from one species to another. Interspecific hybridizations 
were very common among yeast species too (Morales and Dujon 2012). Modern 
brewing yeast, Saccharomyces pastorianus, is the offspring of two successive 
hybridizations, an ancestral one between Saccharomyces uvarum and an unknown 
species and a more recent one between the resulting hybrid and S. cerevisiae 
(Nguyen et al. 2011). 

Despite these interesting observations, zoologists were stuck with a very conser- 
vative notion of species, based on reproductive isolation, i.e., two species were 
considered as different if the offspring of their mating was sterile. This remarkably 
conservative thinking did not take into consideration that many natural fertile 
interspecific animal hybrids were already described: liger (lion and tiger), pizzlies 
(polar bear and brown bear), Hawaiian duck (mallard/Laysan duck), Heliconius 
butterflies (Heliconius cydno and H. melpomene) and Darwin’s finches, to name 
only a few (Pennisi 2016). However, this very conservative way of thinking hit a 
wall when genome-wide sequencing of ancient human DNA demonstrated that 
modern Homo sapiens were the result of at least two interspecific hybridizations. 
The first one occurred 50,000-—80,000 years ago between Homo neanderthalensis 
and ancestral Homo sapiens, after their “out of Africa” journey. This resulted in the 
retention of 1—4% of Neanderthal genes in all modern Homo sapiens genomes, 
except for those of pure African descent (Green et al. 2010). The second hybridiza- 
tion occurred between offsprings of Homo sapiens and Homo neanderthalensis and 
a new species of ancestral human, the Denisovan man (named from the cave in the 
Siberian Altai mountains in which it was discovered). The hallmark of this hybrid- 
ization can still be seen in present-day Melanesian populations in which 4-6% of 
genes come from this ancestral Denisovan man (Prüfer et al. 2014). More recently, 
the same team discovered the remnants of a 13-year-old girl who was the daughter of 
a Neanderthal mother and of a Denisovan father, demonstrating that these two 
ancient human populations also hybridized with each other, around 50,000 years 
ago (Slon et al. 2018). 
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Successive hybridizations can be detected as chromosomal introgressions, large 
DNA fragments which may be fixed by natural selection following backcrossing 
between an hybrid and one of its parents (Fig. 5a). One such example in modern 
humans comes from the Tibetan population. Their genome contains a transcription 
factor induced under hypoxic conditions, EPAS/, whose expression correlates with 
hemoglobin levels in low atmospheric oxygen pressure. This gene is located in a 
120 kb chromosomal region containing a large number of SNPs that were very 
common in Tibetan and Denisovan DNA, but found at very low frequencies in Han 
Chinese genomes. This proved that adaptation to high altitude in Tibetan 
populations was due to a large chromosomal introgression inherited from their 
Denisovan ancestry (Huerta-Sanchez et al. 2014). 

At the present time, it is safe to admit that interspecific hybridizations have been a 
significant source of gene novelty in eukaryotic genomes, from fungi to animals and 
plants. However, if living species may mate with other species living in a close 
ecological niche and produce a fertile offspring, we should now define species 
independently of the outdated reproductive barrier. Indeed, one may ask what is a 
species? 


9.2 Whole-Genome Duplications 


Compared to interspecific hybridizations, bringing together two distinct sets of genes, 
whole-genome duplications bring together two exact same sets of genes (Fig. 5b). 
Whole-genome duplications were extremely frequent in every branch of the eukary- 
ote tree, in ascomycetes (Dujon et al. 2004; Kellis et al. 2004; Wolfe and Shields 
1997), in paramecium (Aury et al. 2006), in teleostean fish (Jaillon et al. 2004), plants 
(International Wheat Genome Sequencing Consortium 2014; Jaillon et al. 2007; 
Vision et al. 2000), rotifers (Flot et al. 2013), and vertebrates (Dehal and Boore 
2005), just to cite a few (Fig. 1). These whole-genome duplications were rapidly 
followed by extensive gene loss, in order to restore gene dosage, but some of the 
duplicated genes—also called onhologues—may be maintained for a longer time and 


Fig. 5 (continued) up in selecting a chromosomal region from the other parent (Species B) that will 
become a permanent introgression. It is possible that other mechanisms besides backcrossing may 
generate chromosomal introgressions. (b) Whole-genome duplication will be followed by extensive 
gene loss to counteract gene dosage defects. Sub- or neofunctionalization may occur on one of the 
two onhologues. Only one chromosome was represented for the sake of clarity, but all chromo- 
somes are duplicated in this process. (c) Segmental duplication of a large chromosomal segment 
(in brackets) may produce several duplicated genes in a single event. (d) A gene (in orange) may be 
transferred from another organism. Horizontal gene transfer may also affect a small number of 
genes. (e) A gene is reversed transcribed and the cDNA integrated in the genome. Former introns 
are possibly lost in the process if reverse transcription occurs on a spliced transcript. Note that an 
allelic transposition is represented but ectopic duplications are frequent 
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evolved new functions by neo- or subfunctionalization. S. cerevisiae harbors two 
copies of cytochrome c resulting from an ancestral whole-genome duplication, one 
encoded by the CYC/ gene and the other by CYC7. The latter is expressed when 
oxygen levels are so low that cells are in hypoxia, whereas the former is expressed 
when oxygen levels are normal, a classic case of subfunctionalization (Downie et al. 
1977). An interesting example of neofunctionalization was discovered in an Antarctic 
fish, the eelpout Lycodichthys dearborni, whose genome contains two SAS genes, 
resulting from an ancient duplication. Both SAS-A and SAS-B genes encode an 
enzyme involved in sialic acid biosynthesis. SAS-B got subsequently partially 
duplicated and the resulting paralogue was deleted for four out of six exons, making 
a much shorter gene. The resulting protein happened to bind more efficiently ice 
crystals than the full-length protein, interfering with crystal growth and behaving as a 
good antifreeze protein. Subsequent tandem amplifications of this shorter version of 
SAS-B gave the eelpout the ability to resist extreme cold conditions (Deng et al. 
2010). 

It might prove technically difficult to discriminate between a recent whole- 
genome duplication and an interspecific hybridization between two closely related 
species, without a good reference. It is possible that some chromosomal duplications 
that were thought to arise from whole-genome duplications were actually acquired 
by hybridization. In a near future, the achievement of more and more eukaryotic 
genomes originating from the same clade should eventually dismiss any concern 
about the origin of close paralogues. 


9.3 Segmental Duplications 


Another frequent source of novelty comes from local or ectopic duplication of a 
chromosomal DNA segment, called segmental duplication (Fig. 5c). Their length 
range from a few to several hundreds of kilobases and they have been found in every 
eukaryotic species sequenced so far. They are also commonly called copy-number 
variations (or copy-number variant, or CNV) since their copy number may vary from 
one genome to another, or structural variant (SV). Spontaneous segmental duplica- 
tions were found in the yeast S. cerevisiae, during experimental evolution of a wild- 
type strain (Dunham et al. 2002) or using a gene dosage assay for growth recovery 
(Koszul et al. 2004). These chromosomal duplications could be sometimes quite 
large, covering 41-655 kb. It was subsequently demonstrated that the mechanism 
generating segmental duplications was break-induced replication (BIR), a 
replication-based recombination process that could involve homologous sequences 
or microhomologies at the junction of duplicated segments (Payen et al. 2008). 
Segmental duplications were also described in mouse (Bailey et al. 2004), in 
primate genomes (Cheng et al. 2005), as well as in man (Bailey et al. 2002). They are 
known to be associated with several human disorders (Emanuel and Shaikh 2001) 
and most of them were found to have recently emerged in human history (Jiang et al. 
2007). They are undoubtedly a source of gene novelty by successive duplications of 
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large chromosomal segments, although their impact on gene content diversity has 
not been precisely evaluated yet. 


9.4 Horizontal Gene Transfer 


Very common between prokaryotes, horizontal gene transfer of a gene (or of a small 
number of genes) was limited to a few examples in eukaryotes, but may be more 
widely spread than previously thought. Such events have been identified among 
Saccharomycetaceae yeasts (Fig. 4). Out of 255 species-specific genes, 11 were 
identified as possible gene transfers from bacterial species, based on sequence 
similarities and reconstructed phylogenies (Rolland et al. 2009). In S. pombe, 
34 genes were identified as good candidates for horizontal transfer from bacteria, 
16 having occurred before radiation of the clade, 9 being specific to S. pombe (Rhind 
et al. 2011). 

Sexuality is a natural obstacle to the propagation of a horizontally acquired gene 
to metazoan offspring since it must become integrated in the germ line. Nonetheless, 
some remarkable examples of gene transfer between bacteria or yeast to animal 
genomes have been described. Wolbachia pipientis is a symbiotic bacteria living 
inside several arthropods and some nematodes. Its genome sequence led to the 
discovery that 44 out of 45 Wolbachia genes were indeed integrated in the genome 
of the tropical fruit fly Drosophila ananassae, one of the natural hosts of this 
bacteria. Among the other species subsequently screened for the presence of 
Wolbachia genes, one nematode, one mosquito, one tick, three wasps, and five 
Drosophila species contained DNA fragments of various lengths originating from 
the bacteria (Dunning Hotopp et al. 2007). 

Another striking example is the horizontal transfer of yeast genes to pea aphid 
(Acyrthosiphon pisum). This insect displays a red-green color polymorphism that 
serves to escape its natural predators. The different colors are due to different forms 
of carotenoid pigments found in individuals. Animals require carotenoids for several 
essential functions but they are unable to make them. Therefore, they normally find 
them in their diet. Remarkably, seven carotenoid synthases and carotenoid 
desaturases, enzymes required for pigment biosynthesis, are encoded by the aphid 
genome. Comparisons with existing sequences showed that these genes cluster with 
orthologues from fungi species and subsequent experiments led to the conclusion 
that these genes were transferred from a fungal pathogen or aphid symbiont, at the 
root of the aphid clade, followed by subsequent duplications of the transferred gene 
(s) (Moran and Jarvik 2010). 

One last example comes from bdelloid rotifers, near-microscopic animals found 
in freshwater habitats worldwide. They lost sexual reproduction due to a specific 
chromosomal organization incompatible with meiotic recombination (Flot et al. 
2013). Telomeric regions of Adineta vaga, a bdelloid rotifer whose complete 
genome has been sequenced, revealed dozen of genes of foreign origin. These 
were found in large telomeric chromosomal segments covering tens of thousands 
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of nucleotides and encoding various proteins playing a role in sugar or amino acid 
metabolism, in intracellular oxydo-reduction, or in the synthesis of antibiotics and 
toxins. Most of these genes came from bacteria or fungi species, some of them may 
have been transferred from plants. Among genes that were identified as of bacterial 
origin, some harbored introns, whereas their bacterial counterpart did not, suggesting 
that introns were acquired after transfer from bacteria. Telomeric regions being also 
enriched in transposable elements, the role of transposons in these massive gene 
transfers is still an open question (Gladyshev et al. 2008). 


9.5 Single-Gene Duplication 


Single-gene duplications may occur as allelic or ectopic genome insertions. When 
occurring in allelic position, they led to tandem repeats of paralogous genes, and 
were found in variable numbers in eukaryotic genomes. In ascomycetous yeasts, a 
few dozen tandem gene arrays were detected in each species, mostly composed of 
two to three copies. However, the Debaryomyces hansenii genome contained no less 
than 247 arrays of tandem paralogues, distributed all over its genome, some of them 
counting eight or nine tandemly repeated copies (Dujon et al. 2004). Ectopic 
paralogous gene duplications were also very frequent events in eukaryotes. Most 
carry the hallmark of retrotransposition: lack of introns, presence of a 3'-end polyA 
tract and remnants of target site duplications. These retrogenes were also called 
retroposons (Brosius 1991) and the transposition mechanism was studied in 
S. cerevisiae (Schacherer et al. 2004) as well as in human cells (Esnault et al. 
2000). It relies on the reverse transcription of a mature mRNA by a reverse 
transcriptase (encoded by L1 elements in human cells), followed by integration of 
the cDNA at an ectopic or allelic locus (Fig. 5e). These duplicated genes lack 
promoter sequences that were absent from the mature transcript and are therefore 
pseudogenes, unless they luckily transpose near an active promoter. The human 
genome contains approximately 10,000 retrogenes, including more than 1700 ribo- 
somal pseudogenes, while the mouse genome contains more than 200 copies of 
glyceraldehyde-3-phosphate dehydrogenase and Caenorhabditis elegans genome 
harbors more than 2000 pseudogenes (reviewed in Richard et al. 2008). 

Extensive retroposition was also frequently detected in plants, the rice genome 
containing 1235 retrogenes. Interestingly, only 337 (2796) were identified as 
pseudogenes containing premature stop codons or frameshifts. Subsequent experi- 
ments concluded that more than half of the remaining retroposons were probably 
functional genes. In addition, 380 out of 898 intact retrogenes harbor a chimeric 
structure containing a flanking exonic sequence (Wang et al. 2006). Therefore, 
contrarily to the human genome in which most retroposons are pseudogenes, 
retroposition in the rice genome seems to be an active process rapidly creating 
new functional genes. 
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9.6 De Novo Gene Creation 


Some remarkable cases of de novo gene invention have been well documented, 
although the total number of such cases having occurred during evolution of 
eukaryotes is probably underestimated. Alu retrotransposons are very common in 
primate genomes, being found in more than 1,000,000 copies, covering «1346 of the 
genome size and present in almost every protein-coding gene intron (International 
Human Genome Sequencing Consortium 2001). In dozens of reported cases, an Alu 
sequence was found to be spliced with an upstream exon, resulting in a chimeric 
peptide (Makalowski et al. 1994). These hybrid proteins are a source of genetic 
novelty, although their total number in the human genome has not been precisely 
determined yet. 

Before eukaryotes, the living world was asexual, except for bacterial conjugation 
that may be considered as a very primitive form of mating. Differentiation between 
two sexes appeared with the first eukaryotic cells and was found almost universally 
in the eukaryotic world, suggesting that it must be an ancestral acquisition. Sexual 
reproduction starts with the fusion between two haploid gametes of opposite sex, one 
male and one female, called syngamy, followed by the merging of both genetical 
contents. It was recently discovered that the protein responsible for syngamy (called 
HAP2) was structurally and functionally related to a viral membrane fusion protein. 
HAP2 was conserved in plants and animals and must have been transferred from a 
virus to a common ancestor at the root of the eukaryotic lineage (Fédry et al. 2017). 

Therian mammals include marsupials and placental (or eutherians), like mouse or 
man (Fig. 1). In eutherians, egg development takes entire place within the uterus and 
the placenta is larger and more elaborated than in marsupials. In humans, two genes 
were responsible for placenta growth, syncitin-] and syncitin-2. These genes both 
derived from an envelope protein gene captured from an ancestral virus 25-40 
million years ago. Remarkably, the mouse genome harbored two homologues, 
syncitin-A and syncitin-B, also deriving from a viral infection in the murine lineage 
around 20 million years ago, but they are not orthologous to their human counter- 
parts, showing that the placenta was independently invented twice in two mamma- 
lian lineages by a similar mechanism of viral gene capture (Dupressoir et al. 2009). 

In D. melanogaster, the Sdic gene coding for a sperm-specific dynein chain was 
the result of a local duplication and a complex rearrangement between two genes: 
Cdic and AnnX. The resulting Sdic gene was transcribed from a neo-promoter 
located in an intronic sequence and the first 21 amino-acids of the resulting protein 
came from this same intron, now spliced as the first exon of the Sdic mRNA 
(Nurminsky et al. 1998). 

One may argue that the above examples are not real de novo gene creations, since 
they rely on preexisting DNA sequences (Alu elements, viral genes, or serendipitous 
rearrangements of existing exons). It is remarkable that the genome of the excavata 
Naegleria gruberi (Fig. 1) contained 40% of genes without any obvious similarity to 
any bacterial gene, suggesting that they could be real de novo eukaryotic inventions 
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(Fritz-Laylin et al. 2010). However, it is possible that many genes that appeared to be 
novel have indeed diverged so much from their prokaryotic ancestor that they cannot 
be identified anymore. Hence, the hunt for real de novo gene creation promises to be 
exciting but seriously challenging! 


10 Bioinformatics Tools to Calculate Core- 
and Pangenomes 


Most pangenome analyses were so far performed on prokaryotic genomes. Com- 
puting tools rely on the initial determination of genes belonging to the core-genome, 
followed by addition of all variable genes to build the species pangenome. The initial 
step is crucial, since one wants to identify the exhaustive list of orthologues 
belonging to each of the species isolates. Orthologue identification generally uses 
bidirectional best hits (BDBH), or BLAST followed by a clustering algorithm such 
as MCL, or comparison of protein domains using Hidden Markov Models (HMM) 
(reviewed in Guimaraes et al. 2015). In a slightly different approach, PanOCT used 
synteny information in addition to orthology to define the core-genome. The pro- 
gram used a “conserved gene neighborhood” information to discriminate real 
orthologues from very recently duplicated paralogues whose sequences are indistin- 
guishable (Fouts et al. 2012). 

Calculation of eukaryotic core- and pangenomes is significantly more complex 
for several reasons: (1) the abundance of transposable elements, including novel 
undescribed transposons absent from dedicated databases; (2) the morcellated nature 
of genes, particularly in young eukaryotes; (3) the presence of large gene families 
that make orthologue identification tedious; and (4) the relative incompleteness of 
genomic sequences, particularly of those containing numerous repeats. In an original 
approach trying to tackle these problems, genomic and transcriptomic data from 
19 A. thaliana isolates were analyzed using the GET HOMOLOGUES-EST soft- 
ware, designed to use tissue-specific expression patterns to build core- and 
pangenomes. Results support a set of 26,373 core genes and of 11,416 variable 
genes, for a pangenome containing a total of 37,789 genes. The pangenome is open, 
each new isolate adding approximately 70 novel variable genes. Core genes exhibit a 
higher expression level than variable genes and they are under stronger selective 
pressure (dN/dS « «1), confirming what was already observed in other eukaryotes. 
The same software was used to analyze transcriptomic data from 16 Hordeum 
vulgare isolates (barley), a monocotyledon plant. The barley genome is 34 times 
larger than A. thaliana (4 Gb vs. 119 Mb) and contains 7596 of repetitive elements. 
Its core-genome contains 10,922 genes whereas 28,762 genes were found to be 
expressed in the leaf transcriptome. Nine isolates were sufficient to sample 99% of 
the pangenome and its size did not increase with subsequent isolates, proving that it 
was closed (Table 1). Like A. thaliana, core genes were more expressed and more 
constrained than variable genes (Contreras-Moreira et al. 2017). Merging tissue- 
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specific transcriptomic and whole-genome sequencing data promises to become a 
powerful approach for future core- and pangenome determinations in metazoans and 
plants. 


11 The Eukaryotic Pangenome 


As Francois Jacob put it more than 40 years ago, gene evolution mainly deals with 
tinkering, molecular tinkering (Jacob 1977). Young eukaryotes (angiosperms, mam- 
mals) reshuffled gene exons and protein domains that already existed in old eukary- 
otes (fungi, excavata, monocellular animals, and algae), more than one billion years 
ago. There were very few real inventions after the first eukaryotes, some of them 
aforementioned here. An Alu element or a piece of a virus genome may be captured 
to make a new protein domain, transposons moved around, sometimes taking along a 
piece of DNA that would eventually become an exonic sequence, accumulation of 
mutations in a duplicated gene copy could ultimately create a new function by sub- 
or neofunctionalization. The redundant nature of eukaryotic genomes, particularly 
young ones, is only apparent. Eukaryotic core genes are hidden behind legions of 
transposons, successive whole-genome duplications and interspecific hybridiza- 
tions, but one may ask how many genes are part of the eukaryotic core-genome. 
When trying to define it, exons or protein domains, rather than genes, should 
probably be considered as relevant genetic units, to circumvent issues due to 
molecular tinkering. Further definition of an eukaryotic pangenome will prove to 
be along and complex task, but the accumulation of high-quality genome sequences 
and the exponential increase of computing power, might prove it to be a reachable 
goal in the forthcoming years. 

In 2016, a German team tried to reconstitute the prokaryotic core-genome, using 
sequences from 1847 eubacteria and 134 archaebacteria species, covering 6.1 
million protein-coding genes belonging to 286,000 families. They identified 355 pro- 
teins common to all species, that may be considered as the prokaryotic core-genome 
(Weiss et al. 2016, 2018). But one may ask whether this minimal set of core genes is 
sufficient to support life. In an attempt to create a hypothetical minimal genome, the 
J. Craig Venter institute applied synthetic genomics approaches to Mycoplasma 
mycoides. Using a combination of existing deletion data and literature mining, 
eight independent segments covering altogether the whole M. mycoides genome 
were synthesized. Each of these eight segments was individually reintroduced into 
bacteria, but only one of them produced a viable genome. Using high-throughput 
transposon mutagenesis, the team subsequently identified a set of 229 genes that 
would cause different levels of growth impairment. The eight DNA segments were 
rebuilt including these genes. Although each of the individual segment was able to 
produce a viable genome, addition of the eight segments in the same bacteria was 
lethal. Once the team eventually solved this synthetic lethality issue and succeeded 
in synthesizing a fully functional minimal genome, they discovered that the biolog- 
ical function of 146 genes (out of 473 encoded) could not be assigned. These genes 
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of unknown function were all needed to sustain M. mycoides life (Hutchison et al. 
2016). This interesting work supports the conclusion that designing a minimal 
genome based on a core set of genes common to several isolates or to several species 
might not be sufficient to support life. Therefore, defining pangenome contents 
might prove essential to rewrite the genomes of more complex organisms, like 
eukaryotes. 

As one last word, it must be noted that core-and pangenomes described here took 
only into consideration protein-coding genes. It is noteworthy that eukaryotes 
contain many more genes encoding various RNA species: tRNA, rRNA, snoRNA, 
scRNA, microRNA, and siRNA. Building the whole repertoire of such genes will be 
challenging but essential to define, at last, a complete eukaryotic pangenome. 
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Abstract Over the last few years, pangenome analyses have been applied to eukary- 
otes, especially to important crops. A handful of eukaryotic pangenome studies have 
demonstrated widespread variation in gene presence/absence among plant species 
and its implications on agronomically important traits. In this chapter, we focus on the 
methodology of pangenome analysis, which can generally be classified into two 
different types of approaches, a homolog-based strategy and a “map-to-pan” strategy. 
In a homolog-based strategy, the genomes of individuals are independently assem- 
bled, and the presence/absence of a gene family is determined by clustering protein 
sequences into homologs. Alternatively, in a “map-to-pan” strategy, pangenome 
sequences are constructed by combining a well-annotated reference genome with 
newly identified non-reference representative sequences, from which the presence/ 
absence of a gene is then determined based on read coverage after individual reads are 
mapped to the pangenome. We highlight the advantages and limitations of the 
homolog-based strategy and several variant approaches to the “map-to-pan” strategy. 
We conclude that the “map-to-pan” strategy is highly recommended for eukaryotic 
pangenome analysis. However, programs and parameters for pangenome analysis 
need to be carefully selected for eukaryotes with different genome sizes. 
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In 2005, Tettelin et al. introduced the concept of a pangenome, namely the entire 
gene set of a species, in their study of eight strains of Streptococcus agalactiae, that 
causes neonatal infection in humans (Tettelin et al. 2005). The pangenome is 
comprised of a “core-genome” that contains genes shared by all individuals of the 
species, and a “dispensable genome” containing genes present in some but not all 
individuals of the species. The core-genome is generally believed to be responsible 
for functions essential to the species, such as growth and development, whereas the 
dispensable genome confers functions related to environmental adaptations 
(Vernikos et al. 2015). During the past 10 years, pangenome studies have been 
widely applied to bacteria and other microorganisms. However, only a handful of 
pangenome analyses of higher eukaryotes have been reported (Wang et al. 2018; Hu 
et al. 2018; Sun et al. 2017; Zhao et al. 2018; Ou et al. 2018; Darracq et al. 2018; 
Montenegro et al. 2017; Pinosio et al. 2016; Golicz et al. 2016; Nguyen et al. 2015; 
Lu et al. 2015; Yao et al. 2015; Hirsch et al. 2014; Read et al. 2013; Li et al. 2010, 
2014). In this chapter, we will first review the biological insights highlighted from 
these studies. Then, we will introduce current challenges and strategies for 
performing eukaryotic pangenome analysis, and finally, we will discuss future 
directions in this field. 

Next-generation sequencing (NGS) technologies have enabled whole-genome 
sequencing and comparisons of multiple individual genomes within a species. Single 
nucleotide variations (SNPs), small insertions and deletions (InDels), and structural 
variations (SVs), including copy number variations (CNVs) and presence/absence 
variations (PAVs), can be identified when comparing against a reference genome. A 
considerable number of SVs have been observed among human (Sudmant et al. 
2015; Genomes Project et al. 2015; Feuk et al. 2006) and animal genomes (Bickhart 
and Liu 2014). For example, a typical human genome contains 2100-2500 structural 
variants (including ~1000 large deletions), affecting ~20 Mb sequences when 
comparing with a reference genome (~3 Gb) (Genomes Project et al. 2015). In 
contrast, SVs have been reported to be more pervasive within plant genomes 
(Saxena et al. 2014), such as rice (Wang et al. 2018; Hu et al. 2018), arabidopsis 
(Cao et al. 2011), maize (Swanson-Wagner et al. 2010), sorghum (Zheng et al. 
2011), and potato (Potato Genome Sequencing C et al. 2011). For example, the total 
sequences affected by SV that differentiate two typical rice accessions, on average, 
are about 22-70 M (out of ~380 M) (Wang et al. 2018). These results imply that 
there might be widespread presence of gene PAVs associated with SV sequences. 

Pangenome analyses aim to study gene PAVs, providing a new functional 
interpretation of within-species variations. Compared to SV studies, pangenome 
analyses identify undiscovered genomic sequences and their associated genes and 
reveal the species core and dispensable genome. Early pangenome studies focused 
on comparisons among a small number (2-3) of well-assembled individual genomes 
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(Liu et al. 2007; Ma and Bennetzen 2004). These studies revealed the space of 
undiscovered genes and demonstrated widespread gene PAVs within a species. For 
instance, Li et al. assembled an Asian and an African genome, leading to the 
detection of 5 Mb sequences and hundreds of undiscovered genes that are absent 
in the human reference genome. Liu et al. sequenced ten thousand cDNAs of 93-11, 
a Xian(indica) rice accession, and found that 71000 genes were absent in the Geng 
(japonica) reference genome (Liu et al. 2007), which was believed to have diverged 
from Xian ~0.44 million years ago (Ma and Bennetzen 2004); later, Schatz et al. 
compared three assembled genomes of a Xian (IR64), a Geng, and an aus (DJ123) 
accession, and found that ~3000 genes were absent in at least one accession. 

However, studying a small number of individuals cannot reveal the global 
landscape of gene PAVs of a species and cannot confidently identify the species 
core and dispensable genomes. Thus, systematic studies involving a large number of 
representative individuals within a species is highly desired. Large-scale plant 
pangenome studies involving tens to hundreds of individuals have emerged over 
recent years (Table 1). Many of these studies revealed that gene PAVs are a very 
important aspect of the genomic diversity within eukaryotic species/populations that 
can provide significant insights into evolutionary history of the species/populations 
with significant implications on the functional genomic research of important traits. 

In Emiliania huxleyi, a marine phytoplankton important for carbon fixation in 
ecosystems, one-third of the genes in the reference genome are absent in the 
13 sequenced individuals (Read et al. 2013). The core-genome controls inorganic 
nitrogen uptake/assimilation and nitrogen-rich compound acquisition/degradation, 
while the dispensable genome is in charge of metabolic repertoires, of which over 
one-fourth involve iron-binding activities and vitamin B1 and B12 synthesis (Read 
et al. 2013). 

In rice, several studies consistently report that about ten thousand genes are 
missing in the widely used Nipponbare reference genome (Wang et al. 2018; Zhao 
et al. 2018; Yao et al. 2015), and almost all of them can be detected in wild rice (Wang 
et al. 2018). The dispensable genome accounts for >38% of the species pangenome 
and over one-fourth of a typical individual genome (Wang et al. 2018). On average, 
two Xian or Geng genomes differ by about 4000 (~10%) genes, respectively, whereas 
a Xian genome and a Geng genome differ by more than 6000 (~15%) genes (Wang 
et al. 2018). Although the dispensable genome is less studied, it appeared to harbor 
functions related to environmental adaptations, such as regulation of immune/defense 
responses and ethylene metabolism (Wang et al. 2018). Interestingly, the well-known 
Green Revolution gene, sd-/, coding a key enzyme, GA-oxidase20, in the biosyn- 
thesis of the important plant hormone, GA /GA4, is a dispensable gene that associates 
with many important processes in plant growth, development, and responses to 
abiotic stresses (Wang et al. 2018; Zhao et al. 2018). 

In Brassica oleracea (Golicz et al. 2016), bread wheat (Montenegro et al. 2017) 
and wild soybean (Li et al. 2014), it was reported that the dispensable genomes take 
up 18.796, 2096, and 35.796 of the pangenomes, respectively. Although the pipelines 
and parameters/thresholds used to determine gene presence differed a lot in the 
above studies, it is well demonstrated that plants exhibit considerably large 
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Table 1 Representative pangenome studies 
Haploid 
genome 
size 
Species (bps)* N References Strategy Comment 
Homo sapiens 2991 M 3 |Lietal. Directly compar- 19-40 Mb 
(human) (2010) ing two de novo sequences 
assembled individ- | containing 7150 
ual human genes cannot be 
genomes (an Asian | found in the 
and an African) reference. 
with the human 
reference genome. 
Emiliania huxleyi | 168 M 14 | Read et al. Building a refer- 7.1300 reference 
(coccolithophore) (2013) ence genome from | genes are not pre- 
an individual sent in the 3 indi- 
genome; assem- vidual genomes; 
bling 3 additional | the core-genome 
individual accounts for 2/3 
genomes and com- | of the reference 
paring them with genes. 
the reference 
genome; determin- 
ing presence/ 
absence of refer- 
ence genes by 
mapping short 
reads of additional 
10 individuals to 
the reference. 
Zea mays (maize) | 2135 M 503 | Hirsch et al. | Sequencing the Identifying ~8600 
(2014) transcriptome of representative 
503 accessions. transcript assem- 
Assembling genes | blies (RTAs) 
from transcriptome | absent in the B73 
sequencing. A reference; 16.4% 
gene with FPKM RTAs express in 
> Ois considered | all lines and 
as present. 82.7% express in 
subsets of the 
lines. 
Glycine soja 924 M 7 Li etal. Sequencing and de | Dispensable 
(wild soybean) (2014) novo assembling genome accounts 
7 individuals’ for 20% of the 
genomes. Cluster- | pangenome, and 
ing annotated displays greater 
genes to gene sequence varia- 
families. tion than the core- 
genome. 
Oryza sativa 374 M 1483 | Yao et al. Aligning Detecting ~9000 
(rice) (2015) low-depth (1~3x) genes for the 
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Haploid 
genome 
size 


Species (bps)* 


References 


Strategy 


reads to a 
pangenome; build- 
ing the dispensable 
genome by assem- 
bling the pool of 
unaligned reads 
from each individ- 
ual. Indica and 
japonica acces- 
sions are sepa- 
rately studied. 


Comment 


indica dispens- 
able genome 

and 76000 genes 
for japonica dis- 
pensable genome. 


Brassica 514M 


oleracea 


Golicz et al. 
(2016) 


Using a reference- 
based iterative 
strategy to assem- 
ble the 
pangenome: 

(1) mapping reads 
to the reference 
sequence; 

(2) assembling 
unmapped reads; 
(3) and updating 
the reference. 
Determine gene 
PAV by mapping 
short reads to the 
pangenome. 


Dispensable 
genome accounts 
for 18.7% of the 
pangenome. 


Triticum 
aestivum (bread 
wheat) 


13,672 M 


18 


Montenegro 
et al. (2017) 


Building a refer- 
ence genome; 
Constructing the 
pangenome 
sequences by com- 
bining the refer- 
ence genome and 
non-reference 
sequences, which 
are assembled 
from the pool of 
unaligned reads 
from each individ- 
ual. Determining 
gene presence/ 
absence by map- 
ping short reads to 
the pangenome. 


Dispensable 
genome accounts 
for 35.7% of the 
pangenome. 
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Table 1 (continued) 
Haploid 
genome 
size 
Species (bps)* N References Strategy Comment 
Oryza sativa 374M 453 | Wang etal. | Assembling 3010 | Discover 283 M 
(rice) (2018), Hu individual non-reference 
et al. (2018), | genomes indepen- | sequences with 
Sun et al. dently; building >10,000 genes; 
(2017) representative Dispensable 
non-reference genome accounts 
sequences by for 35.7% of the 
removing the pangenome. 
redundant Dispensable 
sequences from the | genes tend to be 
pool of contigs that | younger, shorter, 
are unaligned to exhibiting higher 
the reference. level of SNPs. 
Constructing a 
pangenome by 
combining the 
reference genome 
and representative 
non-reference 
sequences. 
Determining gene 
presence/absence 
for 453 individuals 
with sequencing 
depth >20 by 
mapping short 
reads to the 
pangenome. 
Capsicum 3095 M 383 | Ou et al. Using the same Discover 956 M 
(including 4 spe- (2018) strategy as the non-reference 
cies) (pepper) above rice study. sequences with 
250,000 genes; 
55.796 of the 
pangenome show 
>50% presence 
frequencies in all 
the 4 species. 
Oryza sativa and | 374 M 66 |Zhao et al. Sequencing and de | Discover 
Oryza rufipogon (2018) novo assembling 710,000 
(rice and wild 66 individual non-reference 
rice) genomes. Cluster- | genes; 62% of the 
ing annotated pangenome can 
genes to gene be found in 760 
families. individuals. 


"The genome size was obtained from NCBI genome database. It can be the size of a reference 
genome or the average size of several independent assemblies 
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dispensable genomes, harboring functions related to many agronomically important 
traits. Moreover, several studies consistently demonstrate that dispensable genes 
tend to be younger (Wang et al. 2018; Chen et al. 2012; Bush et al. 2013), shorter 
(Wang et al. 2018; Bush et al. 2013; Schatz et al. 2014), have less exons (Wang et al. 
2018; Bush et al. 2013; Schatz et al. 2014), harbor a much higher level of sequence 
variations (Wang et al. 2018; Li et al. 2014), and have fewer paralogs (Wang et al. 
2018; Bush et al. 2013). 


1 Eukaryotic Pangenome Analysis Strategy 


Because pangenome is a property of a species/population, any desirable pangenome 
study should seriously consider its sampling strategy such that the maximum gene 
PAVs can be detected with a minimum number of samples. According to the core 
collection concept in plant genetic resources (Frankel and Brown 1984), a core 
collection of a plant species germplasm consisting of limited but well-sampled 
accessions of a plant species would represent the whole spectrum of its total 
within-species diversity. In practice, a well-established semi-stratified sampling 
strategy considering both the center(s) of diversity/origin and geographic distribution 
of a plant species has demonstrated that the core collection containing only 5% of the 
total collected accessions of a species would cover ~95% of the total species diversity 
(Jia et al. 2017). Obviously, this concept should equally be applicable to pangenome 
research of animal species. 

For the analytic methodology, almost all bacterial pangenome analyses follow a 
homolog-based strategy (Fig. 1) involving (1) de novo assembly of individual 
genomes; (2) independent annotation of protein-coding genes in each assembled 
genome; and (3) pooling all protein sequences together and clustering them into 
homologs (gene families) or orthologs using protein clustering tools (Steinegger and 
Sóding 2018; Fu et al. 2012) or ortholog grouping tools (Emms and Kelly 2015; Li 
et al. 2003). Gene family presence/absence in each individual can be retrieved from 


homolog-based strategy 
raw reads contigs protein-coding genes gene family 


Individual A sk mm = 


IndividualB ——— = _ Em E 


Individual C = = core dispensable 


de novo assembly gene anotation homology clustering 


Fig. 1 Homolog-based strategy for pangenome analyses. This strategy is widely used for bacterial 
pangenome analyses. It includes the following steps: (1) independent assemblies of individual 
whole genomes; (2) annotation of protein-coding genes for each genome; and (3) clustering genes 
to homologs (gene families) to determine the presence/absence of each gene family 
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the clustering results. This strategy is highly dependent on the completeness of the 
whole-genome assembly. Failure in assembling a sequence segment will lead to 
calling the absence of all genes located on this sequence segment. Moreover, the 
protein similarity threshold for gene family determination may impact the size and 
even the relative size of the core-genome and pangenome. 

Several challenges hinder applying a homolog-based strategy to eukaryotic 
genomes. First, eukaryotic genomes are usually large, ranging from hundred mil- 
lions of bases to billions of bases, and possess a high level of repetitive sequences, 
making whole-genome assembly challenging. Several approaches can help improve 
the assembly, including increasing the sequencing depth, sequencing multiple 
libraries with diverse insertion sizes, and integrating long-read sequencing technol- 
ogies (Rhoads and Au 2015; Schneider and Dekker 2012). However, all of these 
approaches significantly increase the cost of whole-genome assembly, thus limiting 
the number of individuals involved in a study. Second, eukaryotes have split gene 
structures. Automatic gene annotation may be inaccurate and lead to biased results. 
Even with these challenges, there are several studies following the homolog-based 
strategy. Li et al. sequenced seven wild soybean genomes using Illumina technol- 
ogy, each with three libraries (insertion sizes of 180 bp, 500 bp, and 2000 bp) 
(Li et al. 2014). The average overall sequencing depth was 112x. Based on this data, 
they were able to assemble ~89% of the genome. Recently, Zhao et al. sequenced 
66 rice and wild rice accessions, each with two libraries (insertion sizes of 400 bp 
and 700 bp) (Zhao et al. 2018). The average sequencing depth reached 115x, and 
they were able to assemble ~85% of the genomes. Notably, a significant portion of 
individual genomes were not assembled in both studies. The associated genes were 
labeled as “absent” in the corresponding individuals. However, given that these false 
negatives repeatedly happen for certain genes within repeat-rich regions, they can be 
treated as systematic errors. The overall results may be still meaningful. 

Reference-based genomic studies are prevalent in eukaryotes. Researchers have 
been taking tremendous efforts to build more complete reference genomes and 
providing confident gene annotations for important species. These reference genomes 
and their annotated genes are the basis for modern genomics studies. Moreover, 
reference-based genomic variants show a great power in explaining phenotypic 
variations when used as markers for genome-wide association analyses. Therefore, 
when introducing the pangenome concept to eukaryotic genomic analyses, taking 
advantage of a pre-existing well-annotated reference genome is a straightforward 
choice. Following this idea, the “map-to-pan” strategy became prevalent for eukary- 
otic pangenome studies, especially when the target genome is extremely large or the 
study involves a large number of individuals (Fig. 2). The “map-to-pan” strategy 
includes two main steps: construction of pangenome sequences by combining the 
reference genome and non-reference representative (NRR) sequences (upper panel of 
Fig. 2) and determination of the presence/absence of each gene in each individual by 
mapping reads to the pangenome and examining the gene coverage (lower pane of 
Fig. 2). 

Several approaches for detecting NRR sequences have been reported (Wang et al. 
2018; Ou et al. 2018; Montenegro et al. 2017; Yao et al. 2015; Read et al. 2013; Li 
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Fig. 2 “Map-to-pan” strategy for pangenome analyses. This strategy is mostly used for eukaryotic 
pangenome analyses. It includes two main steps: (1) construction of pangenome sequences by 
integrating a reference genome and assembled non-reference sequences; (2) determination of 
presence/absence of each gene (both reference genes and non-reference predicted genes) based 
on read coverage. Four strategies for obtaining non-reference representative sequences are 
introduced 
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et al. 2014). Yao et al. utilized a metagenome-like assembly of mixed unaligned 
reads from 1483 rice accessions with extremely low sequencing depth (1~3x) (Yao 
et al. 2015) (Option 1 in Fig. 2), enabling the detection of ~9000 non-reference 
genes. This approach assembled NRR sequences using heterozygous reads and may 
generate chimeric contigs, especially when considering that non-reference sequences 
may exhibit higher levels of repetitive sequences. A variant of this option (Option 
2 in Fig. 2) is to assemble the unaligned reads from each individual separately and 
retrieve NRR sequences using DNA homology clustering strategies, such as CD- 
HIT-EST (Fu et al. 2012), UCLUST (Edgar 2010), MeShClust (James et al. 2018), 
etc. Golicz et al. utilized an iterative assembly approach (Option 3 in Fig. 2), 
iteratively conducting the following three steps: mapping of the reads to a pseudo 
pangenome (starting with the reference genome); assembling the unmapped reads; 
and updating a new pseudo pangenome with new sequences added (Golicz et al. 
2016). They demonstrated that the sizes of final assemblies were similar regardless 
of the order of individuals added into the iterative process. However, an improper 
ordering may lead to fragmented assemblies. Alternatively, Hu et al. proposed an 
integrated approach (implemented in EUPAN toolkit (Hu et al. 2017)) (Option 4 in 
Fig. 2): (1) independent assembly of individual genomes; (2) generation of NRR 
sequences from homology clustering of all unaligned contigs. This approach has the 
benefit of not involving chimeric sequences as well as keeping better sequence 
completeness. This approach has also been recently applied to hundreds of rice 
genomes (Wang et al. 2018; Hu et al. 2018; Sun et al. 2017) and the 383 Capsicum 
genomes (Ou et al. 2018). This strategy will perform better than Option 2 in the 
scenario where a novel sequence contains a short reference segment (likely to be 
repetitive sequences) in the middle; option 2 will assemble two segmented segments 
instead. However, the process of whole-genome assembly is computationally inten- 
sive, hindering its application to extremely large genomes. In summary, pooling of 
low-depth sequenced genomes may also contribute to pangenome construction 
(Option 1). Options 2-4 are preferable if sequencing depth is high enough for 
independent assemblies. Options 2-3 are extremely useful for eukaryotes with 
very large genomes (e.g., the bread wheat with a haploid genome of >13Gb). 
After the construction of pangenome sequences, gene presence/absence can be 
determined by examining gene coverage when raw reads are mapped to the 
pangenome (lower panel of Fig. 2). Remarkably, very different thresholds have 
been applied to determine a gene's presence. For example, Wang et al. considered a 
gene's presence as CDS coverage (> 1 read) over 0.95 and gene body coverage over 
0.85 (Wang et al. 2018); Ou et al. treated a gene's presence as CDS coverage (71 
read) over 0.6 and gene body coverage over 0.5 (Ou et al. 2018); Read et al. 
considered a gene's presence as gene body coverage (71 read) over 0.5 (Read 
et al. 2013; Montenegro et al. 2017; Golicz et al. 2016) used a threshold of exon 
coverage over 0.05. Unfortunately, such divergent thresholds make the quantitative 
cross-species comparisons of gene PAV-related features meaningless. Theoretically, 
with a high-enough sequencing depth, a gene's presence is equal to that the gene, at 
least the CDS, should be fully covered. Loss of partial sequences of a gene, defined 
as a "functional unit," may cause a loss of gene function. Setting up gene body 
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coverage cutoffs will help distinguish retro-transcribed pseudo-genes from their 
original ancestries. In reality, certain genomic regions may be not covered due to 
both insufficient sequencing depth and unevenness of the sequencing. One plausible 
solution is to lower the thresholds. However, the sequencing depth difference may 
further lead to inconsistencies in sensitivities of gene presence determination among 
individuals; individuals with higher sequencing depth would contain more genes. 
Another possible solution is to study the presence/absence of gene families instead 
of genes by calculating “gene presence” using a low threshold and determining gene 
family presence based on “gene presence.” In this scenario, the unbalanced sequenc- 
ing depths also need to be fixed either by sampling to equal depths or setting up 
dynamic thresholds based on the sequencing depth. Nevertheless, it is not 
recommended to determine gene presence/absence from low-depth sequencing 
data. Gene presence/absence should only be studied and compared for individuals 
with sufficient sequencing data, that is, when mapping to the pangenome, the 
coverage of the genome should be saturated. For example, Wang et al. mapped 
raw reads of ~3000 rice accessions to the reference genome and found that genome 
coverage is stable when sequencing depth exceeds 20x; therefore, gene presence/ 
absence was only studied for a selected set of 453 accessions with sequencing 
depth >20 (Wang et al. 2018). 

The “map-to-pan” strategy also exhibits better accuracy. A pangenome study can 
be technically evaluated at two levels: (1) the accuracy of pangenome (gene anno- 
tation and gene completeness) and (2) the accuracy of gene presence/absence calling. 
The “map-to-pan” strategy utilizes reference sequences and their annotations 
directly. Strategies using a whole-genome assembly (homolog-based, and option 
4 of the “map-to-pan” strategy) will have a higher possibility of detecting complete 
gene sequences. At the gene presence/absence level, the homolog-based strategy has 
a bottleneck in assembling a complete genome, and “map-to-pan” strategies defi- 
nitely show better accuracy when sequencing depth is high enough (Hu et al. 2017). 

After determination of gene presence/absence, similar analyses as seen in bacte- 
rial pangenome studies can be performed for eukaryotes, including but not limited to 
(1) simulating the pangenome and core-genome sizes; (2) constructing phylogenic 
relationships based on gene presence/absence; and (3) exploring functions related to 
the dispensable genome or to a specific dispensable gene. 


2 Future Directions 


In summary, the pangenome is an important property of any eukaryotic species/ 
populations and gene PAVs represent a very important dimension of within-species/ 
population diversity that remains uncharacterized in most eukaryotic species. As the 
costs in genome sequencing decrease, one would expect the pangenome analyses to 
be carried out in more and more species, firstly in most important and/or model plant 
and animal species, and then to natural populations of wild species. Thus, eukaryotic 
pangenome research in the next several years should focus on revealing within- 
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species/population gene PAVs and building the pan-references for species of inter- 
est. The pan-reference of a species should include the reference illustrating (1) all the 
sequences within the species, (2) the connections of alternative sequence segments 
and (3) the genotype likelihoods (allele frequencies) such that all possible mecha- 
nisms (SVs and distribution/activities of transposable elements) potentially respon- 
sible for pangenome expansion and generation of gene PAVs can be clearly 
represented and understood. As pangenomes and gene PA Vs are revealed in more 
and more plant and animal species, the eukaryotic pangenome research will be 
naturally extended to the comparative pangenome analyses, focusing on compari- 
sons of the pangenome constitution between or among related species. Results from 
this kind of research are expected to provide new insights into the evolutionary 
history of eukaryotic species. For example, comparisons between related species or 
between different populations of the same species in portions of the core and 
dispensable genes/gene families in their pangenomes and their patterns how new 
gene emerged will provide important information on their evolutionary history. 
Expectedly, emergences of new species would be accompanied with bursts of new 
gene emergences, while major distinctions with massive gene losses in evolution. 
Also, it would be of great interest to compare the core-genome constitution between 
related species and to compare the dispensable genome constitution between differ- 
ent populations of the same species. In the former cases, one may see the differences 
in key genes and their functionalities between related species. In the latter cases, one 
may discover important sets of genes contributing to adaptations to specific envi- 
ronments important for future plant and animal improvements. In this respect, 
genome-wide association analyses of important traits based on pangenome SNPs 
or based on gene PAVs should be widely adopted (Hu et al. 2018). 

As more eukaryotic pangenome analyses are expected to emerge, the technical 
strategy and methodology in analyses of eukaryotic pangenomes need to be 
improved. Because of the relatively high genome sequencing and analytic costs in 
eukaryotic pangenomes, the NGS technology will remain the primary technology for 
the pangenome studies of most eukaryotes in the short term, particularly for those 
species of very large genomes, and so for the “map-to-pan” strategy elaborated in 
detail here. However, before applying this strategy, specific attentions should be paid 
to the sampling strategy to make sure representative individuals of minimum sample 
size of the target species or population to be used, and to the selection and evaluation 
of parameters of the map-to-pan methodology. In the presentation and storage of 
results from the eukaryotic pangenome analyses, graph-based data structures are 
highly desirable and should be widely used in pan-reference storage and visualiza- 
tion (Zekic et al. 2018; Marschall et al. 2018; Baier et al. 2016). Pioneer work has 
been done in the human genome research, where the NRR sequences might be of a 
small size. Alternative sequences of highly variable regions were added to human 
reference genome, starting with GRCh37 (Church et al. 2011). Alternative 
sequences were anchored to locations along the primary assembly. Besides the 
limited NRR sequences, a large number of SNPs, InDels, and SVs (deletions, 
duplications, and translocations) can also be integrated into the pan-reference 
(Zekic et al. 2018; Marschall et al. 2018; Baier et al. 2016). What is more, read 
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alignment tools and variant-calling tools working on the graph-based pan-reference 
will be required. However, for plant species of high within-species sequence diver- 
sity, the challenge is how to anchor large numbers of NRR sequences, whose sizes 
may be as large as half of the reference genome. Finally, considering the prediction 
of “new” or novel genes based on simple thresholds of sequence homology without 
detailed information on gene functionality is always somewhat arbitrary, the 
pangenome results based on the NGS technology can be validated and improved 
greatly if high-quality reference genomes of relatively few representative individuals 
are included in a pangenome study, particularly for important model species of 
relatively small genome sizes. 
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